Re: Squid performance wish-list

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Thu, 27 Aug 1998 17:45:43 +0300 (EETDST)

On 24 Aug 98, at 12:20, Stewart Forster <slf@connect.com.au> wrote:

> > > > I've done some research on this. Squid would work well with a 4K
> >
> > I guess you have few pointers to some performance comparisons and
> > differing FS internals analysis? Could you share?
>
> I ran disk access patterns from Squid through various FS designs and
> thought about the problems with regard the need to keep around objects for
> possibly longer than a Cyclical FS would allow. This meant still something
> slong the lines of a UFS style filesystem. I then decided we need to
> throw away the directory structure (Squid already has its own index file -
> all it needs is a index straight to the file's inode rather than an indirect
> filename).

 I agree that directory structure adds overhead, but it adds also abstraction
 layer that allows you to use virtually any type of future FS type provided
 by vendor. By bypassing this abstraction, you limit yourself to your FS only.

 You may be right eventually, but I'm really worried about the complexity
 of the task. On Solaris, for eg. VM system and FS cache are very tightly
 coupled, and its efficiency is really very hard to beat. If you want to
 implement all of it yourself, you'd have to deal with all those caches,
 and at the same time conserve precious ram. Then, you have to lock that
 cache into physical ram as there is no point in a cache that could get
 paged out to swap.

> > > > frag/ 8K chunk filesystem. Basically we want to write a stripped
> > > > down version of UFS that allows indexing by inode alone. I have a
> > > > design here at Connect that works out to an average of 2.5 disk
> > > > accesses per new object pulled in, and 1.7 disk accesses per cache
> > > > hit, based on a sample taken from our live caches.
> > > >
> > > > Compare this to an average of approx 7 disk accesses per second with UFS on
> > > > a new object write, and an average of 3.5 disk accesses per cache hit.
> >
> > How do you measure that?
>
> Plugged in disk access traces into some simulations to arrive at the 2.5/1.7
> values. For UFS I just watched the amount of disk accesses with OS tools
> as opposed to what Squid was doing over the same period. It would seem that
> UFS does a great deal more disk accesses that it needs to for Squid.

 Umm, what OS tools? If I look at how my cache is running, I see that for
 about 30 urls/sec my disks are doing about 16-20 reads/sec and about 25-35
 writes/sec, giving the average about 1-1.5 of disk accesses _per URL_.

98/08/26 11:30:01 - Total time: 30007 ms (0.01 hours)
       acl: 155 ms 0.52% 959 calls 0.162 ms/call 31.96 calls/sec
   connect: 164 ms 0.55% 703 calls 0.233 ms/call 23.43 calls/sec
     diskr: 2319 ms 7.73% 1665 calls 1.393 ms/call 55.49 calls/sec
     diskw: 823 ms 2.74% 2582 calls 0.319 ms/call 86.05 calls/sec
     openr: 1959 ms 6.53% 524 calls 3.739 ms/call 17.46 calls/sec
     openw: 2647 ms 8.82% 352 calls 7.520 ms/call 11.73 calls/sec
    unlink: 50 ms 0.17% 3 calls 16.667 ms/call 0.10 calls/sec

(output of iostat -x 30 for the same timeframe as squid stats above)
                               extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd8 16.0 26.8 110.0 198.7 1.3 0.3 37.9 2 30

98/08/26 11:30:31 - Total time: 30018 ms (0.01 hours)
       acl: 207 ms 0.69% 1161 calls 0.178 ms/call 38.68 calls/sec
   connect: 200 ms 0.67% 864 calls 0.231 ms/call 28.78 calls/sec
     diskr: 2817 ms 9.38% 1732 calls 1.626 ms/call 57.70 calls/sec
     diskw: 873 ms 2.91% 2787 calls 0.313 ms/call 92.84 calls/sec
     openr: 2369 ms 7.89% 605 calls 3.916 ms/call 20.15 calls/sec
     openw: 2036 ms 6.78% 433 calls 4.702 ms/call 14.42 calls/sec

                               extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd8 18.1 26.1 120.5 196.1 1.4 0.3 39.3 1 30

 I can see that squid did 20+14=34 opens/sec and disks have done 18 reads and 44
 ops in total. Thats fairly efficient. If you sum together all squid disk ops, then
 we have to face that for about 185 disk ops issued by squid, system is doing about
 44 disk accesses, and that sounds way better than what you claimed for UFS.

 For me, it means that my OS's FS cache is working pretty efficiently, it has most
 hot dir files in ram, it has enough ram for read and write buffering, and it is
 able to optimise disk accesses alot.

( Note that above samples are taken from cache 97% full (13.6GB of 14GB). You see so
  few unlinks because I use modified squid that truncates and overwrites files rather
  than first unlinks then creates. (for 2Mil hits in 24h we usually have about 700
  unlinks only)

 By writing your own FS, you'd have to use direct io (unbuffered & uncached by OS)
 as you would write to raw partitions and to achieve same numbers you'd need
 to also implement very efficient caching. You'd need to allocate large amounts
 of ram for that task and you'd need to make sure that this ram is not paged
 out to swap, you'd need to implement clustering, read-ahead, free-behind,
 avoid fragmentation, etc.etc.

 In my view, there is too much work to overcome something that can be and
 should be fixed by efficient caching. And what needs to be done around squid,
 is to minimise chances that it busts any possible OS optimisations.
 If that needs more knowledge of OS, its fine, as you can't tune your
 box until you understand your OS' internals anyway...

> > I'd rather focus on changing squid's disk usage patterns in such a way
> > that it would make it really easy for ufs and OS to do what we need.
> > I believe there are tons of things we can do in that direction.
>
> I believe we're starting to cap out in this area. I understand very well
> about disk optimisation and Squid is getting close to the end of things
> it can do with regards to UFS without making big assumptions about what
> an OS is doing and the disk layout. As for disk usage patterns, what
> did you have in mind with regard to a system that is fairly random. We
> already use the temporal and spatial locality inherent is cache operation
> to make things work better. I'm open for any suggestions as to what more
> can be done? Remember I'm after quantum leaps in performance, not just
> 5% here and 10% there from minor fiddling.

 OK, I'll give it a try ;)

 To optimize disk io, we need first write down what makes it slow, why
 and what can be done about it. I guess noone will argue that the most
 slow thing in squid is file open, compared to subsequent reads/writes
 it is about 10 times slower. So this is the area of most speedup.
 Although unlink is even slower, I believe that it can be avoided and
 should be removed from squid if at all possible.

 From how I understand what is going on in OS and ufs viewed in context
 of squid workloads (there may be errors, ok?):

 1) Open for read:
  - read L1 dir, find inode for L2 dir
  - read L2 inode and find data blocks used by L2 dir
  - read L2 dir, find inode of actual file
  - read inode of actual file
  - write back L2 inode with updated timestamps
  - write back disk block with file inode timestamps Modified
 (6 disk ops, 4+2 r+w at worst)

 2) Create a new file.
  - read L1 dir, find inode for L2 dir
  - read L2 inode and find data blocks used by L2 dir
  - read L2 dir, find free inode for file
  - read inode disk block for actual file
  - write back disk block with new inode Created
  - write back L2 inode with updated timestamps
  - write back L2 data with created filename
 (7 disk ops, 4+3 (r+w) at worst)

 3) Delete file
  - read L1 dir, find inode for L2 dir
  - read L2 inode and find data blocks used by L2 dir
  - read L2 dir, find inode for file
  - read inode disk block for actual file
  - write back disk block with inode Deleted
  - write back L2 inode with updated timestamps
  - write back L2 data with Deleted filename
  - write back free list index blocks (deallocate data blocks)

 (8 disk ops 4+4 r+w)
  + few for data deallocation (I'm not sure how that's done)

 4) Open for write with truncate
  - read L1 dir, find inode for L2 dir
  - read L2 inode and find data blocks used by L2 dir
  - read L2 dir, find inode for file
  - read inode (disk block) for actual file
  - write back disk block with inode Modified
  - write back L2 inode with updated timestamps
  - write back free list index blocks (deallocate data blocks)

 (7 disk ops 4+3 r+w)
  + few for data deallocation (I'm not sure how that's done)

 5) Delete file with Create file in fast succession
  (Delete file)
  - read L1 dir, find inode for L2 dir
  - read L2 inode and find data blocks used by L2 dir
  - read L2 dir, find inode for file
  - read inode disk block for actual file
  - write back disk block with inode Deleted
  - write back L2 inode with updated timestamps
  - write back L2 data with Deleted filename
  - write back free list index blocks (deallocate data blocks)
 (8 disk ops 4+4 r+w)

 (Create file)
  - cached: read L1 dir, find inode for L2 dir
  - cached: read L2 inode and find data blocks used by L2 dir
  - cached: read L2 dir, find free inode for file
  - cached: read inode disk block for actual file
  - cached: write back disk block with new inode Created
  - cached: write back L2 inode with updated timestamps
  - cached: write back L2 data with created filename
  - write back free list index blocks (allocate data blocks)
 (8 disk ops 4+3 (r+w) at worst)

 Total: probably write ops consolidates, most reads are cached,
 (8 ops perhaps, 16 ops without any caching at worst)

 Other than this looks like a stupid thing to do, (ie. delete a file
 in one area of disk, only to create a new file in other area, both
 of which most probably are not cached), this is also incredible
 food for fragmentation...

 ---

 So, given that average seek times for modern disks are about 7ms
 and average disk ops for open of 7 we get that any open operation
 should take about 50ms(!)
 Not true.
 The whole issue is the efficiency of caching.

 1) It is clear, that all L1 dirs would get into OS cache, and all
    disk ops (seeks) regarding them are avoided.

 2) It is clear that those L2 directories that have been accessed
    recently would also be accessed from cache, but this is already
    not so certain, and depends heavily on squid dir access patterns.

 3) inodes of files are unrealistic to cache, because it takes too
    much ram, so only recently accessed file inodes are cached.

 4) Squid creates its directory structure upon installation, that
    means that all directory files and their inodes are located
    close to each other (UFS clustering)

    This also means that seeks between file data and dir data is
    nearly full-stroke job, ie. expensive.

 --- Now, what can be done?

 1) Syncronous metadata updates means physical disk io before open()
    returns. You _want_ to avoid that.

  - Enable Async metadata updates on your volume.
   (external to squid)

 2) Who needs updated timestamps in squid volume anyway?

  - Disable metadata updates at all if you can (FreeBSD for eg.)
   (external to squid)

 3) You want to make sure that all L2 dirs fit in memory cache always.

  - Use minimal number L1 x L2 directories your disks can hold!!
    (if your disk is 500M, and average object is 10K, then you can
     hold 50K objects, there no point in creating L1/L2 dirs that
     can hold 1M objects! you _only_ waste your ram)

  - Make both L1 and L2 dir files utilised as much as possible.
    (current default of 128 files per dir fills only 5K of dir file,
     now, given that OS operates with pages, you have a second page
     almost totally unused.)
    Disk ops happen in disk blocks. So, you optimal dir size is equal
    to disk block size (8K?) You can put 384 files in 8K and this is
    optimal number of files per L2 dir (leaving 488 bytes unused)
    Given that L2 dir names are 2-3 chars long, you can put 640 L2
    dirs into L1 dir (512 + 128), instead of 256.
  - The less pages your dir cache uses, the hotter its usage is, the
    more likely it will stay in cache, speading up file opens.

    For same reason I still believe it is much more efficient to let
    squid fill all dirs from filenumber 1 always, ie. always find
    lowest numbered fileno, instead of letting fileno counter wrap
    at high 16M, leaving about few files in each L2 dir. Or, if
    setting L1/L2 to something low, worry about filenumbers getting
    low. With fill-from-1 you simply would never reach fileno's you
    do not need, thus you use minimal possible subset of L1/L2 dirs.
  - YOU want your L1/L2 dirs to be always full
  - You want L1/L2 dirs to be accessed frequently.

  - Increase your DNLC and inode caches from OS defaults WAY up.
   (as inode cache is shared by both dirs and files, then if you want
    squid to survive few thousand file opens and still have all your
    dirs in OS cache, you must have huge DNLC/inode caches indeed)

    Open files have their inodes locked in ram.
  - Try cheating OS so that it locks in ram all pages related to L2 dirs.
    (I have few ideas for solaris, but not sure in general. But I believe
     this is external to squid anyway)

 4) Delete + Create takes much more disk ops than open with truncate
  - Avoid deleting files, avoid creating files. Leave them alone, if
    you need disk space, overwrite old files with truncate.

 5) Truncate does unneeded unallocation of data blocks. If we are going
    to write a file, we definitely are going to need a data block. So can
    we avoid data block deallocation?
  - Do not truncate file upon open(), instead truncate them before close()
  - Try selecting those files for overwrite that are about the same size
    or smaller as the new object.
    (truncate with current offset before close and without a need to free
    any data blocks is equivalent nop)

  If we need to delete unused files, we can do that by either truncating
  unused files to 0 or smallest fragment size, or if we are sure we would
  never need that files again, (I mean really never) then unlink it finally.
  
 6) Dir structure is created by squid upon first start, making it all to
    be located at front of the disk volume, and all file data to be later
    far away from it, increasing seektimes.

  - let squid create directories as it needs, after it has filled previous
    with files. This will allow UFS clustering to place L2 dir structure
    closer to inodes and data block used by actual files.

  - Or, let squid hook half the volume free size into dummy file, _then_
    create the dir structure, and then delete dummy file. This should
    place dir structure into the middle of the spindle, that is at the
    centre of a full-stroke seek.

 7) Disks operate with blocks, either 4K, 8K (16K?). OS fs cache operates
    with pages: mostly 4K, on some systems 8K. Any partial page or disk
    block needs special care, added overhead that might at times mean
    additional disk ops. (for eg. modifying 500 bytes in 8K disk block
    would almost always mean OS would first read 8K block, modify it, then
    write it back. If we'd write 8K, then OS would simply overwrite 8K
    block without a need to first read it in)
  - So, to make it easier for OS, we want to use 8K blocks and 4K fragments.
    There is some waste of disk space, but as it is external to squid
    anyway, then everyone can experiment themselves.

 8) Btw, to answer those who support using few huge files ontop of OS' fs
    and allocating space for objects inside such files:
  - seek inside a file constitues random access patterns, and any sensible
    OS notes that:
    - OS disables read-ahead optimisations, which means that reads are slow,
    - OS disables free-behind optimisations, which means that OS keeps
    just read pages in fs cache as if they were needed again immediately.
    This just wastes memory, and blows something out of cache that would
    have much bigger benefit to keep.
  - this means that you have to do-it-all-yourself or accept suboptiomal
    and increased disk traffic.

 --- Next comes swapin and swapout.

 1) It is clear for everyone that writing/reading 512 bytes at a time with
    random seeks inbetween is busting of any disk optimisations.

  - We want to increase amount of data read or written at a time.
  
  - We want to make squid fetch buffers at around 32-64K, to allow
    writes to happen in larger blocks.
  - We want to open/write/close in fast succession. For objects that fit
    fully into squid internal buffers, it is a "crime" to issue open()
    then wait sometimes upto few minutes until first block arrives (8K)
    and only then write it out to OS. By the time, all OS fs cache is
    many times reused and it has had to deal with partially filled pages,
    disk blocks, etc. Better let the squid buffers to get paged out to
    swap, because when you later write out the whole buffer, they are
    neatly together.
  - For reads, there's little we can do, so we might want to assign
    different sized buffers for read vs writes to conserve memory.
 
  - maybe batch (gather and scatter) writes to disks per L2 dir.

  When you have suffered the pain of opening a file in L2 dir, you want
  to remember that and issue next write to this L2 dir, because its hot
  and fast. If you have many free files in there, then you want several
  next files to go in there in one shot. Because thats faster. So you
  gather several small objects fetched from net, delay their flushing,
  and when you have gathered critical mass (say 128K or something), then
  schedule their writeout into one L2 dir if possible. OS would be able
  to complete most of it in _very_ few disk ops.

 ---

 In theory, if all ufs metadata is constantly cached in ram, then to get
 any object into or from disks, OS needs minimal amount of disk ops. In
 case where clustering is efficient and fragmentation allows, in a single
 disk operation.
 We CAN cache all data for L1 and L2 dirs, but we do need some care
 to make sure they are not kicked out of ram.

 So all comes to amount of cache needed for keeping ufs metadata
 in ram, or most efficient use of it.
 Of course, keeping all metadata in ram with ufs is unrealistic.
 (given 1M object and 512 bytes inode size, you'd need 512MB of ram)
 but we don't really care about that, because UFS is supposed to put
 inode and file data close to each other, that is OS is able to do
 all needed stuff in single or few disk ops.

 What can be kept in ram is directory structure:
   for eg. given 16 L1, 512 L2, (and assuming 384 files per dir), we
   can address 3M objects, and with each L1,L2 taking about 8K of ram
   it'll take 16x512x8K = 64MB of ram.
 This removes the need for step 1 in above list, avoiding 2 disk ops.
 Also, we'd need to keep in ram all inodes used by L1 and L2 dirs.
 (12x512x512 = 4MB)

 If we can arrange for all that, we only do disk ops to actually
 transfer data, that is almost 1-2 disk op per object.

 4) Open for write with overwrite:
  - cached: read L1 dir, find inode for L2 dir
  - cached: read L2 inode and find data blocks used by L2 dir
  - cached: read L2 dir, find inode for file
  - read inode (disk block) for actual file
( - Async: write back disk block with inode Modified )
( - Async: write back L2 inode with updated timestamps )
  - avoid: write back free list index blocks (deallocate data blocks)
 (1 disk ops 1+0 r+w)

 Then, upon close():
  - Async: write back free list index blocks (deallocate data blocks)

 It is hard to believe that pushing very much effort in creating
 specialised FS for squid would give any substatial benefit other
 than perhaps conserving some amount of memory needed to cache all
 directory metadata...

 What do you think?

> > > > This means about a 250% gain can be had from a customised UFS filesystem.
> >
> > Where does this gain come from? Could you tell more detail?
>
> Okay, my final design was to do this:
>
> 4K fragments, 8K chunks.
>
> Each file starts off with a 4K fragment that also contains that file's inode.
> The inode simply lists the length of the file and 63 direct block pointers
> and 64 single indirect block pointers for a inode length of 512 bytes (8 byte
> pointers). The inode number is also the location of that inode on the disk.
> eg. inode 4856 is 4856 * 4K bytes from the start of the filesystem. Most disks
> have 512 byte blocks so working with block devices we easily can find the inode
> and the first 3584 (4096 - 512) bytes of data of the file.
>
> 50% of all squid disk object are < 3584 bytes (on our system anyway) so
> that means 50% of all squid disk object writes will simply be 1 disk access
> to write the data plus inode.
>
> To write more data requires more disk accesses (of course), but since squid's
> disk objects aren't vitally important we can delay updating of inode block
> data and bitmap info for extended periods for disk optimisation.

 This sounds good. But
 - would you implement cache in ram? how'd you lock cache from being paged out to swap?
 - would you implement read-ahead?
 - would you implement free-behind?
 - would you implement delayed-writes,
 - clustering?
 - data caching, its LRU?
 - fsck?
 - work over network (LAN)?
 - spanning over multiple spindles?
 etc.etc.

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:54 MST