Re: memory-mapped files in Squid

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue Jul 29 13:15:56 2003

On 2 Feb 99, at 11:30, Kevin Littlejohn <darius@connect.com.au> wrote:

> > > We need to rewrite squid store interface I guess.
> >
> > Yes. Some kind of abstract store interface is needed. It is kind of hard
> > to tweak the fd-centric design into a non-fd centric system.. This
> > applies to all communication and not only disk I/O.
>
> This stuff applies to any squidFS - the aim should be to have squid store,
> internally, a direct pointer to the location on disk for the object, rather
> than any level of indirection. I think that's one of the crucial points
> - and yeah, we could do with a slight abstraction of the current disk
> handling, so there's not read() and write() sprinkled through asyncio and
> disk.c, and so what's stored as a 'name' (and what's stored as a 'fd' for
> open files) is easily changeable.

 Actually, I believe it would be benefitial to change squid to be more
 object-oriented than hack-oriented ;) More and more stuff gets tied to
 it and it becomes increasingly difficult to see all interactions.
 I believe that everything should be turning around StoreEntry structure.
 It should have all and any bits needed to accomplish everything we want.
 Like vnodes in some OS'es, it should represent an in-memory version of
 cache object, it should have its own locks, temp areas, handlers, etc.etc.
 It should be arranged so that parts of it could be kept in ram, like
 currently memObject can be attached to it if needed. Focus of most
 activity should be towards this only pointer to StoreEntry. This should
 allow us to pass only pointers between calls, avoid lots of data copying
 around in memory, needless malloc/frees and if using adequate locking,
 allows to easily add threading later on. Squid 2 has done long step in
 that direction and I hope it will stay that way.

> > Writing and reading should be isolated from each other. Of course
> > objects should be able to be written back on another disk than it's read
> > from. Objects should always be written to the most suitable disk,
> > regardless if the object came from network, disk or whatever. A object
> > is a object.
>
> Except if you've already incurred the overhead of writing to disk, why
> incur it again? I'm still not convinced that increasing the workload of
> the disk is a good thing to do in the process of attempting to speed up
> disk access. I know that there's many other things affecting disk access
> speeds - but that media is still the slow part of the chain (well, after
> network), so it makes sense to me to keep disk use low.

 It has to do with increasing disk sequential bandwidth and little to no
 increase in disk access time performance.

 Calculate the cost of disk write with different disks and segment sizes,
 lets suppose:
  disk1: avg accesstime=20mS, sustained sequential write: 5MB/s
  disk2: avg accesstime=20mS, sustained sequential write: 20MB/s
  disk3: avg accesstime=10mS, sustained sequential write: 20MB/s
  disk4: avg accesstime=10mS, sustained sequential write: 5MB/s

 1) segment size of 8KB

   time to write random segment:
    disk1: 20mSec + 8K/5M = 21.6 msec
    disk2: 20mSec + 8K/20M = 20.4 msec
    disk3: 10mSec + 8K/20M = 10.4 msec
    disk4: 10mSec + 8K/5M = 11.6 msec

   max random write bandwidth:
    disk1: 8KB/21.6mSec = 370KB/sec
    disk2: 8KB/20.4mSec = 392KB/sec
    disk3: 8KB/10.4mSec = 769KB/sec
    disk4: 8KB/11.6mSec = 689KB/sec

 As you see, although disk performance parameters differ considerably,
 the sustained transfer rate is pretty close, especially for a case
 of comparable accesstime disks.

 ** disk accesstime dictates max random bandwidth of the disk with
    small segments.

 This is the reason why elevator optimisation rulez, it minimises
 average seek times between successive disk operations.

 2) segment size of 1MB

   time to write random segment:
    disk1: 20mSec + 1M/5M = 221.6 msec
    disk2: 20mSec + 1M/20M = 70.4 msec
    disk3: 10mSec + 1M/20M = 60.4 msec
    disk4: 10mSec + 1M/5M = 211.6 msec

   max random write bandwidth:
    disk1: 1M/221.6mSec = 4621KB/sec
    disk2: 1M/70.4mSec = 14545KB/sec
    disk3: 1M/60.4mSec = 16953KB/sec
    disk4: 1M/211.6mSec = 4839KB/sec

 As you can see, with 8KB segment size and random io, we can transfer only
 400-800KB in a second, but if we cluster writes together, we can do near
 15MB writes every second on disks with fast transfers.

 ** for large sequential segments, transfer bandwidth dominates overall
    write performance, and disk accesstimes influence it very little.

 As you can also see, elevator optimisation gives very little here. But
 elevator still gives us edge on read performance.

 We are not increasing disk workload, we are trying to move it to what
 disks are best at, we are trying to do more work during the same amount
 of time.
 We do not want to keep disk use low, we do want keep disk seek count low.
 We can't avoid random access for reads, but we can avoid it for writes.

 So, big and sequential is fast. Small and random is slow. Thats the whole
 reason of log-FS'es and fifo design.

 So, if you append some 64 8K objects (512K) to a write queue, overhead
 added only halfs (for new stuff) the write performance, keeping it still
 way higher than random access. And, because added to write queue only
 if object is already accessed by read, there is no added read overhead.

 As we need to do strictly sequential writes always, we have to keep
 free space sequential, and FIFO is best at this. Unfortunately, fifo
 drops objects at fifo tail and these objects could be quite popular.

 So here we basically only decide if we overwrite once written objects
 with new ones, thus probably loosing some hits on these, or we append
 these objects to the write queue, thus forming LRU from FIFO.

 As long as objects appended to the write queue do not impose higher
 cost than would be to keep totally random writes or refetching from
 the network, we are pretty cool.

 It should be trivial to leave config options to both disable LRU or
 to tweak max objects size that is appended to the write queue.

> > > uhh. can we avoid large multipart files on fifo storage?
> >
> > As you said a hybrid design could be used, where some storage is FIFO
> > and some filesystem based.
> >
> > A third possibility is a spool area for larger objecs, from which
> > completed objects are written to the FIFO. This area can also managed
> > using FIFO to automatically clean out any fragmentation.
>
> I'd be curious to see what the 'best' lower limit for size of objects is before
> you start using this 'staging area'. I'd also be curious to see what impact
> it has on performance if the object sizes drift - if you're heavily hitting
> that area, you may start to loose some of the cyclic gains :(
 
 I guess so too. I believe we don't want to move large objects around too
 often. From some point there might be higher overhead in handling them than
 placing them once into the system fs and living with the overhead of fs.

> There's definately some nifty ideas there, but I'm not convinced that the
> gains from a cyclic fs over a more traditional style fs are enough to warrant
> the extra management complexity - shuffling objects around on disk, etc.

 Agree that the beast should be simple. LRU is famous at being simple yet
 effective. For small objects, FIFO seems cool, for large objects, more
 traditional FS could be better, if not ufs, then perhaps something else.

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:56 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:03 MST