Clustered Store

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Wed, 11 Feb 1998 15:36:26 +0200 (EETDST)

> >> >> Changes to squid-1.2.beta13 (Feb 4, 1998):
> >> >>
> >> >> - #defining MONOTONIC_STORE will support the creation of disk
> >> >> objects clustered into directories. This GREATLY improves disk
> >> >> performance (factor of 3) over old `write-over-old-object'
> >> >> method. If using the MONOTONIC_STORE, the
> >> >> {get/put}_unusedFileno stack stuff is disabled. This is
> >> >> actually a good thing and greatly reduces the risk of serving
> >> >> up bad objects (SLF).
>
> It changes the order in which new swapfiles are placed into
> the directories.
>
> OLD
> cache/00/00/00000000
> cache/01/00/00000001
> cache/02/00/00000002
> cache/03/00/00000003
>
> NEW
> cache/00/00/00000000
> cache/00/00/00000001
> cache/00/00/00000002
> cache/00/00/00000003

 I see. Thats a good start. I guess that it uses roundrobin counter which
 wraps at some 2M? Some time ago we had a debate with Arjan about whether
 its better to select _the_ lowest available file_no (my idea) or to select
 _next_ with wrapping around. Both improves disk perf up to a point.
 I would like to bring it up again, I think it has sort of wider effects.

 Arjan made a patch for 1.1.10 that implemented this scheme with added feature
 to fill each dir with configurable count of files before going on to next.
 384 files/dir seems optimal, as dir block size becomes 8192 bytes (exactly
 1 disk block) thus increasing bang-for-a-hit (in terms of OS dir cache).
 It makes open-for-writes pretty fast.

 My idea was to always select lowest free file for the next write. The
 effect is that squid never uses directories it doesn't need.
 If cache can hold 1M objects, and fill-factor is 384, squid needs to have
 1M/384 = 2800 dirs. Having 256 L2 dirs, this makes about 10 L1 dirs. Almost
 2 times less. with 8KB dir size, this takes about 22MB on ram for OS to cache
 _all_ the dir-files for efficient fileops like open()

 Well, while cache is filling, there is almost no difference between these
 two methods. Next open for write is very probably fast because "current"
 dir is "warm" in the OS cache.

 Problems appear after few "wraps", when cache is full, as file_map becomes
 more and more fragmented due to differing object lifetimes. Thus "clustering"
 becomes less and less efficient and objects become more and more spread
 across many directories. Both methods start to fill in the holes and
 eventually writes begin into pretty random directories. Disk trashing
 returns back. The same is true for my suggestion also, with few differences:

 active working directory set is reduced to minimum possible, thus upping
 dir cache hit to theoretical maximum (err, on vast average I guess),
 writes are sequential as long as there appears no holes due to object
 release. Does squid ever release an object without "touching" it on disk?
 If it usually does touch, then it "warms up" the dir cache for the next
 write anyway and use of unused_files cache is good and filling the hole
 is not a "penalty".

 Well, with very fragmented disk store, Arjan's method is slightly more
 efficient on writes, perhaps, especially if normal garbage collection is
 in sync with new incoming objects and "clears the road" for the file_map
 counter. Its slighly less efficient on reads although.

 I'm running squid with these patches for quite some time now. Cache is
 very full, and "emergency LRU" is not so rare. What can I say about
 performance? Compared to default squid its a tremendous speedup, as
 Arjan has pointed out on his web page. Dayly (10-20h) average time
 for open(RO) is at 6.4 mSec, open(WR) is 12 mSec.

 I've added some measurements for unlink() calls into squid, and found
 that unlink() calls take 40 mSec on average (with some maxes at 100mSec).
 So, unlinks are really very expensive.

 Now, if compared to Arjan's results, this seems not quite so good, but
 I'm not sure if Arjan's cache was filled and into LRU problems by the
 time he did his samples. Perhaps he can clarify. When our cache was
 not full, the times for RO/WR were pretty close at 4 mSec for both.
 (Our cache size is 14GB (1M objects) and doing 50K TCP reqs/h, NOVM.18 train)

 I guess that one of the reasons our open(WR) times are about 2x slower
 than open(RO) is that they are truncating already existing files from
 the pool of unused_files instead of waiting for unlink to happen.

 Alright, thats about all on performance.

 Now I'd like to speculate about some possible features that arise if we
 can use we use tightly clustered store space (that is always fill
 smallest numbered files first)

 1) of course, we can skip precreation of directories, and do that on-demand.
 2) we can dismiss L1/L2 configuration, and calculate optimal structure
    based on filesystem's block size only. (as this changes via newfs usually,
    we are quite safe from otherwise lethal L1/L2 reconfigurations)
 3) basically, as we almost immediately overwrite released files, we can
    think of dropping unlinkd altogether.

 4) After switching to md5 keys and keeping URL separately, we have some nice
 (IMHO) possiblilities:
 As Store_Entry is fixed size, good candidate for being used in array. Index
 into array could be file_no, md5 keys stored outside Store_entry in a
 separate db index, hash, giving md5-key -> file_no mappings only.

/*
 BTW, md5 is already a very good hash, why use another hash algoritm on it
 for key search? Why not use 10-16 lowest bits of md5 key as index into an
 array of pointers to linked-lists where fine-grained compare can be done?
 Having 1024 pointers takes 4Kb and linked-lists would be about 1000 items
 deep each to accomodate 1M objects. Or, if take 65536 pointers, taking
 256KB, each linked-list would be about 32-deep for 2M objects. Hash lookups
 would be tremedously fast, i think.
*/

 Now, what about creating fixed sized "swap_log" into each cache_dir, with
 size holding max possible no of store_entries this particular cache_dir
 can ever hold, .. and mmap-ing it into memory instead of reading in?

 If it is array, we can mmap() it and use it right away, dynamic structures
 can be attached via malloc when needed, and pointers initialized to NULL
 during startups. This file never shrinks or gets larger, upper entrys
 that are not used are not paged in, thus do not take ram, and all this
 array does not take VM (sort of), because the file itself is swapspace for
 data in it. No need to malloc/free from heap, no need to bother writing
 to disk, OS is syncing dirty pages to disks in background. In theory,
 squid could stand even totally sudden crash, unwritten store will be
 flushed independantly. Of course, there is added risk in possibly
 corrupting this mmaped memory area, but such risk is always present anyway.
 Linked-list' *next could be just next file_no instead of pointer.
 I think that we could reduce in-memory data size alot this way.

 Wild. But is it crazy? ;)

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:46 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:42 MST