Re: memory-mapped files in Squid from Henrik Nordstrom on 1999-01-22 (squid-dev)

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sat, 23 Jan 1999 02:42:26 +0100

Andres Kroonmaa wrote:

> In addition, whatever memory cache there is, eventually page flushes
> will also be done at ~50/sec,

If using a cyclical filesystem where objects are stored lineary as they
are received rather than randomly this can be optimized down to 1-5 ops
by colescaling the writes into larger sqeuential chunks.

> Knowing that disk io is 10msec average, that comes to 880msec per
> second average io wait-time.

Wouldnt it be nice to cut that in half?

Also, for loads of this magnitude asyncronous reads should be used.
There is absolutely no point in blocking the whole process for a pagein
of a HIT.

> In fact, thing are much worse. Kernel is too simple to do any clever
> tricks in optimising page flushes.

This is because Squid currently uses random I/O in random files, and
very few operating systems are optimized for random I/O.

> The only way out is to _never_ let squid process block, and implement
> all and any disk optimisations inside squid. One of the coolest disk
> optimisations you can think of is elevator seeking disk io. You just
> can't do that as long as you rely on kernel to do page io.

Yes, and it is very easy to implement too. Have one reader thread /
spindle calling readv() with a sorted list of blocks to read. but it
needs careful thoughts to not waste to much CPU time on sorting. Having
only one outstanding readv() operation at a time makes a natural
interval for collecting and sorting operations.

For writing sequential I/O should be used, either using write threads or
chunked mmap() combined with periodic (once per 1/2 second)
msync(ASYNC). Which to use of write and mmap depends a bit on the
operating system selected, but chances are high that mmap+msync is the
most effective on most operating systems. The exception is perhaps where
mmap is implemented on top of the disk cache (doubles the amount of
memory used). The reason why periodic syncing should be used is to keep
the dirty page list short, and to hint the OS into making the writes
sequential.

No separate writing thread should be required as long as the free page
list is properly maintained. Some tuning of minfree may be required to
ensure this if short on memory, but in most configurations this
shouldn't be needed.

Only problem with mmap() is that it does not mix well with simoultaneous
reading from the same file (the same mmap() segment is ok, but not the
file).

> Large files have many-many inodes. Don't forget that 1 inode
> describes a max of some 12 direct disk blocks. For larger files
> you'd need lots of indirect inodes.

The indirect block pointers are full filesystem blocks, not inodes. A
2GB file on a 8K/block FS uses roughtly:
* 1 inode
* 65 pointer blocks a (ceil 2GB / 8K / ( 8K / 2 ) + 1)
* 65536 data blocks

Any decent OS should keep those few pointer blocks in cache along with
the inode while the file is in active use.

> So, you can't avoid indirection. With double-indirect blocks
> you'd still have up to 3 disk ops until you really get to
> the object. OS caching perhaps makes this look much better,
> but for that you'd again need huge amount of cache ram.

Not sure if max 256KB / GB is a huge amount of ram.. OS caching should
get a very high hit rate on these pages, or you should be looking for
another OS.

> and that basically requires the use of raw partitions with all the
> allocation and free space management implemented in squid.

There is no big difference between raw partitions or large files here.
The big difference between the two is perhaps in cache management as
most OS:es supports uncached raw partition I/O but all requires hinting
for file I/O. With todays smart OS cache managers combined with some
hinting this shouldn't be a problem regardless of which is used.

> of cache ram. If you really want to avoid all overhead, your target
> is really 1 disk op per any type of object operation.

True. But the question remains on how to do this in an efficient and
manageable way.

As you may have guessed (from this message, and my previous message on
this subject), my vote at the moment is for a multilevel cyclical file
system with partially threaded I/O (threaded at a disk/spindle level,
not in any way like the current async-io code).

/Henrik
Received on Tue Jul 29 2003 - 13:15:55 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:02 MST