Re: memory-mapped files in Squid

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Tue, 02 Feb 1999 00:00:35 +0100

Andres Kroonmaa wrote:

> I guess i missed it then. In recent discussion I was just
> objecting throwing away objects from fifo tail, and you weren't
> clear enough ;) Only after I got a hint on how to make fifo
> into LRU, it occured to me...

Well. I assumed that people got the first message, there it was
mentioned that reused objects can be written back to the head, and most
of what I have said so far has been a random collection of thoughts on a
changing subject.

> Hmm. what you mean by "taking up large portion of cache"? I imagine
> that when you rewrite accessed object onto fifo head, its space at
> its old location is released?

Released but not reused. It is storage policy that is FIFO not object
policy. Writing is always done at the head so the space won't be
refilled until we get there again.

> Aah, you mean that when some object hits 10/sec then it shouldn't
> be rewritten to the fifo head at the same rate?

Yes, among other things.

> btw, how did you estimate?

A rought calculation of refreshrates (both TCP_REFRESH and
TCP_CLIENT_REFRESH). But as I said it is an early estimate and I may
have missed something important. As Alex said some simulations should be
done to get a clearer picture of how things behaves.

> Regarding fifo ontop of ufs, absolutely. But it may have some
> benefits to use OS-buffered raw partitions. As I understand,
> unbuffered partitions can't be poll()ed (always ready and block
> on read/write) while buffered can (OS handles disk access
> scheduling and write buffering).

I doubt that poll can be used on any disk media without blocking, raw or
filesystem. But you are welcome to prove me wrong.

> Can we eliminate the need to handle chained multi-fragment
> large objects?

Not in a pure FIFO store, but as you say it may be done in a hybrid
design.

> Also, makes it much more attractive to append recently used
> objects onto the write queue. It would be much more difficult
> to manage with wildly differing objects sizes, and increases
> the cost of doing so.

Francly I does not see the difference. In my thoughts I only write
recently used objects to the head when they are requested, where the
write back sort of piggy backs on the swap in. Regardless of how the
object is laid out you have to read it in for the client, and write it
out on the head if deemed neccesary.

> Perhaps add cache_dir types (fifo/ufs) and max object sizes to
> put on each. Have different routines for each, and in
> configuration decide how large objects are serviced by UFS or
> fifo...

Configuration possibilities are numerous, especially when combining
different kind of storage.

> We need to rewrite squid store interface I guess.

Yes. Some kind of abstract store interface is needed. It is kind of hard
to tweak the fd-centric design into a non-fd centric system.. This
applies to all communication and not only disk I/O.

> We don't want to add indirection layer between swap_file_number and
> physical location on disk.

No, and it has never been the intention to use such an redirection
layer. That would be to essentially reimplement a kind of directory
structure and that was one of the things I wanted to get rid of.

> The most sense in this is that when squid creates new disk object,
> it does not know exactly what its swap_file_number will be. Object's
> fileno will be picekd by IO routines ideally, based on where the fifo
> pointer is at that exact time and which spindle is most idle.

Yes. disk "swap file number" has to be picked when the store layout
manager decides to actually store the object. All communication to/from
store has to be on object level and not swap file numer.

Actually the store has to maintain the metadata, as the store decides on
replacement policy, so this is not really a problem.

The abstraction level should be above this, with operations like "create
object", "open object", "write object data", "read object data", "stat
object info" (timestams, ...).

> Perhaps disk IO routines should even decide which cache_dir to pick.

Or put in another way: Each cache_dir can decide on when it stores
objects, and where these objects are stored.

> In a sense, squid metadata becomes integral part of storage. The only
> key is URL/MD5 hash which resolves directly into (current) location
> on disk. (in case of fifo/lru storage, or into filenumber in case
> of UFS)

Well, this is already how it is, almost anyway. If we are peering then
some additional metadata is needed to quickly determine if the object is
fresh or not, both for ICP and HTTP peerings so we cant get the
in-memory index down to simply a store index. Some additional meta data
well be needed in memory.

If we want to design a efficient store for a non-peering cache then a
different set of rules is in effect. One interesting question is if it
may be possible to drop the in-memory index all together on a
non-peering cache, and use URL hash as store index, combined with a
digest like design to hint if the object is there or not. But this is
very much separated from the FIFO discussion. Neither LRU or FIFO would
be used in such a system (more like a random replacement policy). The
big plus is that there is almost no limit on cache size.

> As I now understand, this is how you mean it to be, ie. squid
> metadata is located directly on the disks near data, and URL
> database is rebuilt during startup directly from this disk data.

Yes. The on-disk database is what got stored together (or rather close
to) the objects.

> Above I assumed that there was some global URL database in squid
> that implemented LRU and mapped URLs to cache_dirs/locations.

In my desgin NO LRU list is used. A FIFO is used with some addons to
mimic a some properties of a LRU, but there is no real LRU in action.

> But perhaps there might be some benefit from being able to write
> just fetched disk object to another disk's write queue?

Writing and reading should be isolated from each other. Of course
objects should be able to be written back on another disk than it's read
from. Objects should always be written to the most suitable disk,
regardless if the object came from network, disk or whatever. A object
is a object.

> Like sort of transaction log? If we had global URL database, we'd not
> need this. But this solution might be even better. hotswap the drive,
> restart squid, and it runs with other subset of URL database...

Yes, I see metadata logs is a sort of transaction log.

In theory Squid can be programmed to hotswap the drive without restart.
What is needed is the functions "disable and release everything from
cache_dir X" and "activate cache_dir X".

> uhh. can we avoid large multipart files on fifo storage?

As you said a hybrid design could be used, where some storage is FIFO
and some filesystem based.

A third possibility is a spool area for larger objecs, from which
completed objects are written to the FIFO. This area can also managed
using FIFO to automatically clean out any fragmentation.

There are a couple of other possibilites as well.

You may remember my wording "multilevel" in earlier messages. This came
from the idea that store could be maintained at multiple levels, where
the first level blindly writes object and the second (and third ...)
level eats objects from the tail onto the next level fifo. This is a
good idea if the first level wastes a lot of space on objects that then
is thrown away (refreshed or aborted during storage), but it may be hard
load balance such a system...

/Henrik

/Henrik
Received on Tue Jul 29 2003 - 13:15:56 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:03 MST