Re: memory-mapped files in Squid

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue Jul 29 13:15:56 2003

On 30 Jan 99, at 2:27, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> Sounds like you have read my first message on cyclic filesystems, sent
> some weeks ago ;-)

 I guess i missed it then. In recent discussion I was just objecting
 throwing away objects from fifo tail, and you weren't clear enough ;)
 Only after I got a hint on how to make fifo into LRU, it occured to me...

> To avoid having frequently
> requested objects taking up a unproportionably large portion of the
> cache my idea is to rewrite hits if they are located more than a certain
> (configurable) distance from the head.

 Hmm. what you mean by "taking up large portion of cache"? I imagine that
 when you rewrite accessed object onto fifo head, its space at its old
 location is released? Aah, you mean that when some object hits 10/sec
 then it shouldn't be rewritten to the fifo head at the same rate?

> * To be able to guarantee that writes can be done sequentially a FIFO
> type storage policy should be used, possibly with extensions to avoid
> throwing away objects we want to keep (probably) and to defragment
> unused space (less likely, early estimates puts this in the range of
> 2-5% and it is non-trivial to implement).

 btw, how did you estimate?

> * Also, with a FIFO style replacement policy, it is very hard to justify
> the overhead incurred by a standard filesystem. Probably better to use
> raw partitions, both from a I/O performance level and how much memory
> that is required by the OS for caches/buffers.

 Regarding fifo ontop of ufs, absolutely. But it may have some benefits
 to use OS-buffered raw partitions. As I understand, unbuffered partitions
 can't be poll()ed (always ready and block on read/write) while buffered
 can (OS handles disk access scheduling and write buffering). This could
 allow us continue to use select() style coding without threads if needed.

> * Disk store index is physical location. How closely objects can be
> packed inside each chunk is determined by the index key size, and key
> size is in turn determined by the amount of memory we can spend on it. A

> * Checkpoints are kept as two storage chunk pointers: last fully
> completed storage chunk, and the oldest non-reused chunk. These
> checkpoints may actually be implemented as timestamps in the chunks
> themselves.

> * In-memory index are rebuilt by reading the meta data indexes from the
> storage blocks, beginning with the oldest.

> * Chaining of object fragments is needed to store larger objects where
> we could not afford to wait for the whole object until storing it on
> disk.

 Few thoughts.

 * more than 90% of objects are <32K in size.

  Can we eliminate the need to handle chained multi-fragment large objects?

  Say we let UFS handle objects >32KB, and only put smaller into fifo-store.
  We'd need to add some code for handling that, but it gives us alot of more
  freedom. For eg. we can keep pending objects in ram until fully fetched,
  (keeping upto 32KB per object is not a big deal), but then we can write
  each object out virtually atomically. There's no worries about partially
  written objects. No worries about multiple disk ops per object, no
  worries about multiple object pieces on disks, chunk alignment, etc.

  As >32KB objects are relatively rare, the overhead of UFS is not so big.
  Besides, big files is what UFS is much better at.

  Also, makes it much more attractive to append recently used objects onto
  the write queue. It would be much more difficult to manage with wildly
  differing objects sizes, and increases the cost of doing so.

  Perhaps add cache_dir types (fifo/ufs) and max object sizes to put on
  each. Have different routines for each, and in configuration decide how
  large objects are serviced by UFS or fifo...

 * We need to rewrite squid store interface I guess. Fifo style does not
  need any open/close calls. No need to use filenames. Squid just swaps
  the object into memory, (like get_object(&StoreEntrywithMemObject))
  If object fits into buffer, it will be done in single disk io. If not,
  request is perhaps handled by UFS interface instead, which would
  handle filenames and open/read/close stuff itself.
  So, at some abstraction level, squid should expect objects to be
  fetched from disk in one call. When we need the filename to locate
  object in UFS, then we calculate it when needed and handle as currently.
  Basically, we'd want to move from FD centric activity inside squid
  disk routines to more abstract object centric.

 * We don't want to add indirection layer between swap_file_number and
  physical location on disk. This only slows us down. Because we want
  to change physical location on disks occasionally, we'd like to give
  IO routines the freedom to change swap_file_number in squid metadata
  structures directly. This allows fifo writer to sort its queue after
  object is handled to it, and later, whenever needed.

  The most sense in this is that when squid creates new disk object,
  it does not know exactly what its swap_file_number will be. Object's
  fileno will be picekd by IO routines ideally, based on where the fifo
  pointer is at that exact time and which spindle is most idle.
  Perhaps disk IO routines should even decide which cache_dir to pick.

 In a sense, squid metadata becomes integral part of storage. The only
 key is URL/MD5 hash which resolves directly into (current) location on
 disk. (in case of fifo/lru storage, or into filenumber in case of UFS)

 As I now understand, this is how you mean it to be, ie. squid metadata
 is located directly on the disks near data, and URL database is rebuilt
 during startup directly from this disk data. Above I assumed that there
 was some global URL database in squid that implemented LRU and mapped
 URLs to cache_dirs/locations.
 If we drop global metadata, then we have per cache_dir LRU that is
 implemented inside FS. But perhaps there might be some benefit from
 being able to write just fetched disk object to another disk's write
 queue?

> * On-disk metadata is kept separately for each chunk together with the
> data, The exception is when objects has to be released without being
> replaced by another one. When this happens a release entry is kept in a
> later chunk (the chunk where Squid found that it needs to remove the
> object from cache by some reason).

 Like sort of transaction log? If we had global URL database, we'd not
 need this. But this solution might be even better. hotswap the drive,
 restart squid, and it runs with other subset of URL database...

> Another exception is when a (large) object crosses chunk boundaries,
> here the metadata log entry points to a earlier chunk.

 uhh. can we avoid large multipart files on fifo storage?

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------
Received on Tue Jul 29 2003 - 13:15:56 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:03 MST