Re: memory-mapped files in Squid

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 22 Jan 1999 15:45:27 +0300 (EETDST)

On 21 Jan 99, at 16:36, Carlos Maltzahn <carlosm@moet.cs.colorado.edu> wrote:

> On Wed, 20 Jan 1999, Andres Kroonmaa wrote:
> On 20 Jan 99, at 10:30, Carlos Maltzahn <carlosm@mroe.cs.colorado.edu> wrote:
> >
> > The idea is to store all objects <=8K into one large memory mapped file.
>
> The one dead-end is that you cannot mmap more than 3GB of file at a
> time. Thats awfully little, and remmaping some huge file is pretty
> expensive system call.
>
> That's only true for 32bit machines, right? In a year or so, we
> probably all use 64bit machines. Large proxy servers with huge disks

 don't bet on it. PPro's, Pentiums and P-II's won't disappear anywhere.
 And they don't support 64bit stuff. Its not a patter of if OS API is
 64bit or not, for mmap() its a matter of if CPU can manage more than
 4GB of address-space for user apps.
 As Unix currently is done on x86, it can't. eg: Solaris 7 is 64bit
 on Sparcs only.

> use already 64bit architectures. Aside from that, remember that

 _portability_ - thats the keyword squid is built around.

> I'm using this file to only store small objects. According to my traces
> the average object size of <=8K objects is about 2.6K. For >8K objects I
> still use the normal file system.

             class count ( cnt %) Average
                32 - 2484 ( 0.1%) 28.592 bytes avg
                64 - 605 ( 0.0%) 46.898 bytes avg
               128 - 16653 ( 1.0%) 105.956 bytes avg
               256 - 92422 ( 5.4%) 170.156 bytes avg
               512 - 148549 ( 8.7%) 381.014 bytes avg
              1024 - 163440 ( 9.6%) 737.179 bytes avg
              2048 - 246320 ( 14.5%) 1528.806 bytes avg
              4096 - 351723 ( 20.7%) 2896.821 bytes avg
              8192 - 260802 ( 15.3%) 5844.554 bytes avg
             16384 - 204140 ( 12.0%) 11530.280 bytes avg
             32768 - 110066 ( 6.5%) 22678.601 bytes avg
             65536 - 50017 ( 2.9%) 44958.557 bytes avg
            131072 - 16766 ( 1.0%) 87796.877 bytes avg
               +++ - 4868 ( 0.3%) 1627970.653 bytes avg
  Total: 1.961e+10 - 1668857 ( 98.1%) 11750.551 bytes avg

 (1.6 Million objects with total of 19GB)

 And it all depends on hell knows what.
 But you are right, and we are on the same side - small objects (for me,
 starting from <32K), should be handled differently. Especially because
 UFS overhead is tremendous compared to actual object transfer times.

> If 3GB is getting too small one could
> come up with a scheme where originally all objects are stored in files but
> objects (<=8K) which produce more than one hit per day move into a memory
> mapped file.

 yep, it could. but implementation nightmare...
 Of course, squid could mmap just a bunch for in-mem object cache. There
 is a difference whether you mmap a file for such cache, or use swap space.

> The another drawback is that physical disk io becomes pretty unpredictable.
> With heavy load disk times go up to xxx ms range, you can't afford
> the whole squid process being blocked by page-io. One solution is
> fully threaded code, or, quite a huge rewrite.
>
> Yes. I found that especially the 30sec sync wrecks havoc on large memory
> mapped files. But for dedicated machines one could easily change the sync
> rate to a much larger interval, say 5mins. If the machine crashes, 5mins
> or even an hour of missed caching is not going to kill me.

 You should _never_ use sync from an app. Kernel should automatically flush
 dirty pages periodically. You should only tune what percent of all dirty
 pages should be flushed per run.
 This way you would have dirty pages staying inmem for some max time, have
 constant pageouts going on, and that all in background, as page flushes
 are done by kernel. So your proccess is not blocked ideally.

 But this is not what I was talking about. I hope you got me right.

 OK, let me make a rude and buggy calculation ;)
 If squid is serving 100 obj/sec, and we agree that at least about 80% is
 cacheable and 40% is hits, we can take it'd have at about 88 disk ops/sec.
 (it appears like 40/sec reads and 48/sec writes)
 This means that you'd block squid about 40 times per second to pagein
 the hit objects.
 In addition, whatever memory cache there is, eventually page flushes
 will also be done at ~50/sec, and every awhile when you need new ram page,
 there will be a shortage of memory and OS would need to free some pages,
 by flushing them. This all means that with given load, you'd be having
 about 88 page-io's per second that will block the whole squid process,
 (_on average_, 40/sec for random reads + avg 36 writes/sec. writes are
 gathered in spikes of flushes, occuring regularly, but reads are totally
 random)
 Knowing that disk io is 10msec average, that comes to 880msec per second
 average io wait-time. or, most of the time squid is being blocked by
 page io.
 And that's for load of 100 obj/sec. Take 200/sec and you get that to
 be impossible.

 In fact, thing are much worse. Kernel is too simple to do any clever
 tricks in optimising page flushes. VM system looks at memory page by
 page. It does not matter how many spindles there are, and any randomness
 in page locations on disks makes only things worse.

 The only way out is to _never_ let squid process block, and implement
 all and any disk optimisations inside squid. One of the coolest disk
 optimisations you can think of is elevator seeking disk io. You just
 can't do that as long as you rely on kernel to do page io.

> unmapping file with dirty pages adds to the fuel - all dirty pages are
> flushed to disks, and as they are paged directly out from user space,
> the whole process is blocked for the duration of unmap.
>
> The bottom line seems to be that mmap is suited for cases when you
> mmap/unmap rarely. For squid this would hit 3GB limit.
>
> Yes. I don't propose to mmap/unmap frequently.
 
 Then its not a viable idea. Current limit is 3GB. And it will stay that way
 for years. And 3GB is neglibly small cache. Especially if you take into
 account that nearly 50% of cache volume is held in objects <16K.

 We should seek other ways to achieve the same goal.

> > I'm not familiar with Squid's source code and the various sub systems it
> > uses. But ideally one could integrate memory mapped files with Squid's
> > memory pools so that loaded parts of the memory mapped file don't have to
> > compete with web objects in memory pools.
>
> You are on the right track, but mmap is not the way to go. UFS overhead
> is what makes squid slow, to avoid that squid would need its own FS.
> As most overhead comes from directory structure maintenance and free
> space handling, these are the most obvious areas of work.
>
> Isn't most of the UFS overhead inode management and the fact that you use
> individual files? With mmap you have _one_ inode that is constantly used
> and therefore cached. You don't have to worry about file system level
> fragmentation. So even though a Squid FS would be nice, I disagree with
> you that it is necessary to significantly reduce disk I/O.

 Large files have many-many inodes. Don't forget that 1 inode describes a
 max of some 12 direct disk blocks. For larger files you'd need lots of
 indirect inodes. for 2-4GB ufs file, you'd need full inode indirection, with
 the use of exclusively double-indirect inodes. BTW, current ufs file size
 limit is 2-4GB, that's another portability issue.
 (try find some paper on UFS internals, you'd much better understand what
  the OS has to cope with. Also, check out how most popular OS'es VM system
  works. I could only point you to Solaris VM internals, but this is also
  pretty good reference.)

 So, you can't avoid indirection. With double-indirect blocks you'd still
 have up to 3 disk ops until you really get to the object. OS caching
 perhaps makes this look much better, but for that you'd again need
 huge amount of cache ram.
 inode management is an issue, you are right, but its small compared to
 dir and free space management. we want to eliminate all those overheads,
 and that basically requires the use of raw partitions with all the
 allocation and free space management implemented in squid. And that's
 basically squid-FS, whatever way its done. I've been trying to show
 that to some extent squid could be tuned alot to make OS easier to
 optimise disk access, but it all also had hared dependance of amount
 of cache ram. If you really want to avoid all overhead, your target
 is really 1 disk op per any type of object operation.

> > Have people tried something similar? What is your experience? Would you
> > extend Squid's swapout/swapin modules or Squid's memory pools. What other
> > possibilities do exist?
>
> "squid-FS and alikes".
> In fact, squid should be rewritten somehow so that it has some
> generalised (object)file io api that could be more easily modified. Then
> it would be possible to experiment with many different algoritms.
> Currently, if you want to try your stuff out, you'd have to rewrite
> pretty much of squid, and then you are your own.
>
> Yup - too bad. I'm currently looking into other open source web
> server/proxy software that might be easier to extend. Medusa
> (www.nightmare.com) comes to mind...
>
> > I ran two trace-driven file system load generators, one similar to Squid's
> > current file access, and one with a combination of memory-mapped file
> > access for <=8K objects and normal file access for >8K objects. The
> > results were very encouraging. So I really like to implement something
> > like this in Squid and see whether it indeed saves a lot disk I/O.
>
> it will, until you go beyond some tons of GB's. Then mmap()ing overhead
> will take over. The main benefit you get is you avoid dir structure
> indirection overhead, perhaps upto 6 disk io's per 8K object. thats
> alot. But that should be handled by some other means.
>
> I dunno what OS you run on, but in my experience, Solaris 2.6 goes boo-boo
> after you try handling >~15GB with 512MB of RAM and 10% fragmentation.
> Only after this point you can really see drastic influences of differing
> algoritms.
>
> I used DUNIX 4.0 with 512MB and three disks with a total of 10GB. The two
> days of trace used 4GB for >8K objects and about 1.7GB of the 2GB
> memory-mapped file for <=8K objects.

 Thats too small set for real-life tests. 4GB's with 512M ram - you can't
 really notice any considerable disk bottlenecks. Take 40GB's to the same
 box, and pinch it with 100 reqs/sec, and then take your samples.
 
> Are you talking about running Squid on a 15GB/512MB system? The effect you
> are describing could be because 512MB might not be enough to handle the
> meta data of a full 15GB cache without swapping, right? As long as the
> size of the metadata is linear dependent on the size of the cache you
> should run into this problem in any case.

 No, My squid process size is near 170MB, and there's 100% of it in RAM.
 (sol2.6 has a cute patch that allows to keep app code and data in memory
 in preference to disk cache buffers) I have zero swap activity (in fact,
 32MB of ram is kept free) Everything else goes to OS fs buffercache, both
 for delayed writes and for most hot objects.
 The problem is not in linear dependancy, linear is justified. The problem
 is that to have effective disk io, box has to have LOTS of ram to avoid
 most of the dir (or inode) structure overhead. And this is not linear
 dependancy with increasing amount of disk space, at least not with the
 current allocation methods. And exactly this is why we are in a need for
 something more clever.
     
> I'd suggest looking into using just plain swapfiles in bunch, either
> ontop of UFS or on raw partitions. Implement block allocation and
> maintenance yourself, tightly coupled with squid' needs. There is
> one work in progress for squid-FS that looks very interesting, and
> if you like, I can present you my own ideas on the subject some time.
>
> I'd be very interested in finding out more about this. Who is working on
> squid-FS?

 I know Stewart Forster and now Kevin Littlejohn have a work on squid-FS.
 It has many interesting ideas in it. But it also imho has portability
 problems, and it is not even alpha yet, but better ask Kevin.

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:55 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:02 MST