Re: Storing partial responses

From: Joe Cooper <joe@dont-contact.us>
Date: Tue, 12 Dec 2000 23:59:03 -0600

With more thought on the subject, and thought on the discussion so
far...I can see that we are going down a road that could very easily
cause damage to Squid's FS performance, which by all accounts could use
some fixing already. And a possible solution came to me, that may be
farther off the mark than I imagine, so I hope folks will set me
straight if I am. Discussion below.

Alex Rousskov wrote:

> On Wed, 13 Dec 2000, Robert Collins wrote:
>
>
>> What about a mixed approach?
>
>
>> - meta level support for the global cases, the logic on caching,
>> deciding what's needed upstream, and what downstream format to
>> send the response in (specifically a Content-Range or
>> multipart/byteranges).
>
>
>> - register callbacks into the FS to perform FS routines such as
>> storing the actual partial ranges. Then the FS can daisy chain
>> them, use sparse files, whatever takes the FS implementors fancy.
>
>
>
> Looks like I need to improve my classification:
>
> (0) FSs assume that an object key corresponds to a single,
> complete response (object). A FS is a database of
> blobs, known to be complete objects, with a
> store(key, object)
> object = load(key)
> interface. This is the current approach. It makes
> impossible to support partial objects.
>
> (1) FSs know absolutely nothing about partial objects
> (or complete objects for that matter). A FS is a
> database of blobs with unique keys, with a
> store(key, blob)
> blob = load(key)
> interface. FS optimizations are limited because the
> semantics of a key is not known (its just a unique opaque
> string). Meta-level code assembles blobs into complete
> or partial objects.

Ok, it strikes me that 1 here could be a good part of the implementation
of the cyclic object store that was discussed by Henrik, Adrian, Alex,
and several others a while back and documented on Henrik's webpage at
Sourceforge. I recently discussed this briefly with Henrik and Adrian
offlist just on performance issues (since it had been hashed to death
here with no one stepping in to actually work on it), and part of the
design that was touched on as being the Probable Right Way(tm) (at least
IMHO) was that of a cyclic object store with a fixed object size.

Now, we take Alex's suggestion 1 here, and further limit it (in
complexity at the FS level) to fixed objects of 8 or 16 or 32kb (or some
other fixed size--whatever proves fastest on benchmarks). Every object,
large or small, gets broken into, or padded (possibly combined), to fit
that object size and then striped onto disk in the same manner as the
Typhoon news server does (and purportedly Cisco's Cache Engine, though I
can't confirm that one).

This forces the 'meta-level' code to assemble our objects and know what
objects are in what blocks. We recall the idea of rewriting partial
objects when they become complete and expiring the old ones...or even
just rewriting them and 'forgetting' the old ones as we sweep across the
disk. (If you haven't reviewed the cyclical object store discussion
lately, the idea is that writes always go from the head of the disk to
the tail of the disk, and then start over at the head again. Reads, of
course, must still be random, but writes can be done in relatively large
and very cheap operations...while removal is essential zero disk
operation--just drop it from the in-core hash table or object db as it
will eventually be written over in the normal course of disk events.)

 
> In either case, a meta level that knows how to handle partial objects
> will be required. For example, HTTP "ranges" logic will reside on that
> level.

This seems to me to indicate that we're about to add some complexity
(maybe a lot) to the meta- and FS levels of the code in order to deal
with an object that can be of varying sizes or divided over the disk.
One way or another we've got to increase the knowledge and logic of the
file system and the upper levels by a quite large amount. Maybe we
should take it the extra two steps to solve a few more issues,
particularly the issue that this change will present--speed.

Now...I should also point out that I'm not here proposing (though it
certainly could seem so) an actual 'filesystem' be implemented at the OS
kernel level as may have been on the table in the earlier referenced
discussion--though I don't think it was ever established that it /had/
to be done at that layer. I am actually leaning more towards the OS
independent implementation as it has been done by the aforementioned
Typhoon. In that case, the news spool has been generated in standard
UFS files within the normal UFS directory tree. There you have a file
(2GB in size on 2.2 Linux, or whatever size other host OS' dictate) or
multiple files that have been 'striped' with blocks and the space has
been locked down as in use. Then the server simply stripes news from
one end to the other, most likely in fixed block sizes, and in strict
adherence to an end to end write schedule.

I think this solves part of the problem for us, with regard to partial
content. We will already have the code to store parts of objects in
separate places (admittedly in sequence, but since we will be rewriting
popular objects anyway these growing objects can be rewritten along with
them--while metadata that describes what ranges we have goes into the
in-core hash or db).

I hope I'm not rehashing territory that has been considered laid to rest
here. I just wanted to make sure we don't end up doing things twice or
thrice, if we can merge a couple of goals and do things with less work
overall.
                                   --
                      Joe Cooper <joe@swelltech.com>
                  Affordable Web Caching Proxy Appliances
                         http://www.swelltech.com
Received on Tue Dec 12 2000 - 22:51:57 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:13:03 MST