Re: memory-mapped files in Squid

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Mon, 25 Jan 1999 12:24:38 +0300 (EETDST)

 Hi,

On 23 Jan 99, at 2:42, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> Andres Kroonmaa wrote:
>
> > In addition, whatever memory cache there is, eventually page flushes
> > will also be done at ~50/sec,
>
> If using a cyclical filesystem where objects are stored lineary as they
> are received rather than randomly this can be optimized down to 1-5 ops
> by colescaling the writes into larger sqeuential chunks.

 Not sure what you are talking about, but yes, I agree that there are ways
 to reduce needed io to 1-2 per objects. But compared to mmap this might
 be another thread.
 
> > Knowing that disk io is 10msec average, that comes to 880msec per
> > second average io wait-time.
>
> Wouldnt it be nice to cut that in half?

 Why? As well we could insist on cutting it by 3 or 4. The calc was as I
 said rude and wrong ;) And if we take that avg disk io is 10msec and we
 are doing about 88 random ios/sec, I can't see how we could do that in
 440ms/sec...
 
> Also, for loads of this magnitude asyncronous reads should be used.

 By async reads you mean threads? Are we ready talking about threads
 as of portable solution? That would be cool of course.

> There is absolutely no point in blocking the whole process for a pagein
> of a HIT.

 right.

> > The only way out is to _never_ let squid process block, and implement
> > all and any disk optimisations inside squid. One of the coolest disk
> > optimisations you can think of is elevator seeking disk io. You just
> > can't do that as long as you rely on kernel to do page io.
>
> Yes, and it is very easy to implement too. Have one reader thread /
> spindle calling readv() with a sorted list of blocks to read. but it
> needs careful thoughts to not waste to much CPU time on sorting. Having
> only one outstanding readv() operation at a time makes a natural
> interval for collecting and sorting operations.

 I'm afraid that if data blocks are not sequential, readv does not give
 much benefit, sorted or not. You just avoid OS/app context switches.
 To arrange for readv you perhaps waste cpu comparably. And besides,
 you can't start writing to network until _all_ blocks in readv are
 serviced.
 I'd imagine this be done as linked-list where you can always find a
 right place to insert new request into the chain, and then use regular
 read/write. btw, I guess OS does also sort multiple requests somehow.

> For writing sequential I/O should be used, either using write threads or
> chunked mmap() combined with periodic (once per 1/2 second)

 I'm afraid, sequential writes can be only possible when there is ample of
 free disk space, ie. cache is not full and fragmentation low. You can't hope
 on that. Perhaps there are ways to ensure sequential free space, but it
 wouldn't be trivial.

> > Large files have many-many inodes. Don't forget that 1 inode
> > describes a max of some 12 direct disk blocks. For larger files
> > you'd need lots of indirect inodes.
>
> The indirect block pointers are full filesystem blocks, not inodes. A
> 2GB file on a 8K/block FS uses roughtly:
> * 1 inode
> * 65 pointer blocks a (ceil 2GB / 8K / ( 8K / 2 ) + 1)
> * 65536 data blocks

 Yeah, here you got me, "fast fingers". Thats the problem when we talk from memory.
 checked it, and here's how it goes: (assume 8KB blocks)
 - 1 inode per any sized file. inode has 15 pointers to data blocks.
   first 12 points to 12 direct data blocks.
   12x 8KB = describes 96KB of file)
 - 13th pointer, (if file size > 96KB) points to data block holding indirect
   pointers. for 8K, 8/4 = 2K pointers. 2K x 8KB = describes 16MB.
 - if still not enough, then 14th pointer of inode points to double-indirect
   (pointers') block. each pointer there points to indirect block.
   (8KB* 2K*2K, describes 34GB)
 - if still not enough, then 15th and last inode pointer points to
   triple-indirect block, where each pointer points to double-indirect block.
   2K x 2K x 2K x 8KB, describes 70TB. huh.

 So, basically with 8KB data block, we'd never need triple-indirects. Not
 even needing all double-indirects.

 For 2GB file, we need:
 * 1 inode (locked in ram anyway)
 * 12 direct pointers (inside inode)
 * 128 indirect pointer blocks, a 8KB (1MB)
 * 256K data blocks.

> Any decent OS should keep those few pointer blocks in cache along with
> the inode while the file is in active use.

 Yes, it should be able. But it simply doesn't always.

> > So, you can't avoid indirection. With double-indirect blocks
> > you'd still have up to 3 disk ops until you really get to
> > the object. OS caching perhaps makes this look much better,
> > but for that you'd again need huge amount of cache ram.
>
> Not sure if max 256KB / GB is a huge amount of ram.. OS caching should
> get a very high hit rate on these pages, or you should be looking for
> another OS.

 I want to stress here, that OS caching is OS optimisation. Of it we
 can only talk "it should". When talking about overhead of disk io,
 and sub-optimal disk access, it is fairly justified to talk about
 system with zero caching, and then calculate what it would need to
 do, to see what can be done to eliminate the overhead. Only after
 that include cache-awareness to optimise even more.

 By huge amount of cache I mean something else and not so obvious.
 Yes, to describe 2GB file, OS has to keep in ram trivial amount of 1MB.
 If that were the only amount to cache, no box would have any problem.
 But there are other things to cache. They all add up. To avoid 2 diskops
 per mmap()ed file area not cached, you need to lock 20Megs in ram to
 describe 40GB of data. Imho it is not justified to believe that this
 happens as we wish.
 Consider this: daily, my cache pumps 10GB's through its disks, both
 reads+writes. Having 40GB of disk means that only 25% of all disk
 space if even touched in 24hours(!). What on earth makes these 20Megs
 to be locked in ram? Only some subset recently used will stick there.
 All else is cached only if randomness allows.

 So, if we don't have some indirect block in ram, we have a penalty of
 1-2 disk ops before we get to the data.
 Whatever cache optimisations there might be, whatever amount of cache
 ram there might be, this still is overhead, and we want to eliminate it.
 OS cache is a shared resource. If there is lack of ram, cache is the
 first to suffer. Least recently used pages are stealed, and there is
 a very good chance, that only small part of this 20MB is cached on
 a busy cache server. To avoid that, we'd either eliminate the overhead
 completely, or install excessive amount of cache ram.
 That is what I meant by needing huge ram cache.
 
> > and that basically requires the use of raw partitions with all the
> > allocation and free space management implemented in squid.
>
> There is no big difference between raw partitions or large files here.
> The big difference between the two is perhaps in cache management as
> most OS:es supports uncached raw partition I/O but all requires hinting
> for file I/O. With todays smart OS cache managers combined with some
> hinting this shouldn't be a problem regardless of which is used.

 I disagree. There is one important difference: blocks on raw partition
 are continous and in direct relation to block numbers. For large files,
 data blocks are scattered all over the disk and you have no guarantee
 whatsoever that say blocks 5 and 6 are not "full-stroke away" from each
 other. So there might be no benefit from batching writes together.

 In addition, I think that when accessing disk via cached interface, you
 have no guarantees that disk ops are scheduled in the same order as you
 dictate from application. It makes it harder to implement elevator
 optimisations. But then again, cached interface to raw partition might
 have other benefits.

> > of cache ram. If you really want to avoid all overhead, your target
> > is really 1 disk op per any type of object operation.
>
> True. But the question remains on how to do this in an efficient and
> manageable way.

 Currently, squid does at least 3 operations with every object, that are
 dispersed in time: open-read-close or open-write-close.
 For small files, <4KB overhead of open-close is considerable.
 We should look at ways to manage < 32KB objects internally to squid, and
 let objects much larger be managed by OS fs. Objects > 32K are rare, so
 relative overhead is small.
 For small objects, it would be cool to even forget the notion of open,
 close, create, delete. Maintain a map of object->disk-blocks and operate
 on them atomically: readfileblocks, writefileblocks. This map could be
 potentially very small, say bitmap. What has to be solved is the problem
 that no single block size fits all. Fragments needs to used and fragmentation
 problem dealt with.
 But ideally, all indirection could be avoided.

 Note that mmap seems very attractive for that. But it has many objections.
 ( btw, msync(ASYNC) is not supported everywhere, eg. FreeBSD-30)
 
> As you may have guessed (from this message, and my previous message on
> this subject), my vote at the moment is for a multilevel cyclical file
> system with partially threaded I/O (threaded at a disk/spindle level,
> not in any way like the current async-io code).

 Please describe multilevel cyclical FS, or point me to relevant papers.

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:56 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:02 MST