Re: storemanager from Robert Collins on 2001-01-13 (squid-dev)

From: Robert Collins <robert.collins@dont-contact.us>
Date: Sun, 14 Jan 2001 10:38:06 +1100

----- Original Message -----
From: "Henrik Nordstrom" <hno@hem.passagen.se>
To: "Adrian Chadd" <adrian@creative.net.au>
Cc: "Robert Collins" <robert.collins@itdomain.com.au>; <squid-dev@squid-cache.org>
Sent: Sunday, January 14, 2001 6:30 AM
Subject: Re: storemanager

> Adrian Chadd wrote:
> >
> > On Sat, Jan 13, 2001, Robert Collins wrote:
> > > Adrian,
> > > does the modio code you're doing include putting the http headers into metadata, rather than just part of the data stream
sent
> > > to storeAppend? We must decode them something like three times during a typical MISS. (Once in http.c, once in client_side.c
and
> > > (i'm guessing once in store.c)
> >
> > The storage manager doesn't touch HTTP headers.
>
> I have thought this over a couple of times, and I mostly second this.
>
> Lets take the chunked encoding example. Decoding chunked to non-chunked
> on the fly is not that trivial, especially considering non-order ranges
> or trailers. Such decoding can be performed from a spooled reply since
> then we can "know the future", but I don't see it very practical on a
> live stream.

Uhmm actually it is trivial. Trailers are part of the chunked spec: you CANNOT recieve trailers without using chunking.
Multipart/byteranges make up the entity body for complex range requests

ie
==HTTP protocol data===
HTTP/1.1 206 Partial Content
   Date: Wed, 15 Nov 1995 06:25:24 GMT
   Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
   Content-type: multipart/byteranges; boundary=THIS_STRING_SEPARATES

   --THIS_STRING_SEPARATES
   Content-type: application/pdf
   Content-range: bytes 500-999/8000

   ...the first range...
   --THIS_STRING_SEPARATES
   Content-type: application/pdf
   Content-range: bytes 7000-7999/8000

...the second range
--THIS_STRING_SEPARATES--
====

The entire body in this gets chunked on transmission and then at our squid dechunked and passed as a stream as it is recieved to the
next step in the handling process - the storemanager.

>
> Also, I expect that in most cases where we once have been able to send
> HTTP/1.1 to the client, most cache hits for that object will also be
> HTTP/1.1 capable.

Because most clients on the web today are http/1.1 capable? Lets hope there's no downstream caches present :]

> The major exception to the above is if the cache hit is a range request.
>
> However, chunked encoding is used mostly for generated objects where
> filesize or data is not immedately known. I expect most such objects to
> not be cacheable for quite a while.

Chunked encoding is also used for signed messages with Digest authentication, or in any situation where there are possibly unsafe
network conditions. The real story is in compression or encoding transfer-encodings. For example it is possible and legitimate to
encrypt cache to cache object bodies. Course it'd be a big performance hit but it's an example of what can be done in spec. The one
I am excited about is cache to cache and cache to client compression and then (round 2) delta based cache to cache transmission.
The big advantage of chunked is that it's light-weight and keeps the client up to date on entity termination.

> So with all this in mind, my opinion is that TE-decoding should be done
> only when needed by the client. This might be a HTTP/1.0 client cache
> hit, or a range request. To cover for the small "race" where a object is
> stored using HTTP/1.1 and then hit many times by a HTTP/1.0 client the
> object can be respooled in normalized format when decoded. Yes,
> respooling is a small performance hit, but I expect the rate of those to
> be quite low, making it a total gain by not having to perform the
> decoding on every encoded object.

I think your analysis of the performance issues is spot on, but the solution isn't generic enough:
* response modification will likely require decoding to identity format before storage occurs
* Advanced transfer codings may be completely different between the upstream connection and then each cache to client connection.
* We limit ourselves severely by doing that.
* We should _allow_ for decode->store->client, and also allow a hint as to what the lowest common coding to be sent is, and
optionally only decode to that level for the client side call backs. The store should store identity encoding.

> This applies to TE as well as ranges.
I don't agree.

here's an example:

we make three upstream requests for ranges in an object O
A 0-99
B 100-199
C 200-299

A is returned chunked (we didn't ask for chunked, just were given it)
B is returned chunked
C is returned chunked.

we now have 3 cache partial ranges. (Assuming the store doesn't merge them all immediately - which could be harsh on performance
with multi megabyte partials)

If we store identity coded data in the store, (And I'm assuming all headers are stored out of band, or that we know where the body
starts)
a request for 50-149
is trivial to supply: read 50-99 from A, and 100-149 from B. Then client_side can chunk them.

If we stored chunked coding in the store,
we have to
read the first line of A
skip x bytes to the next chunk if 50 is not present in this chunk
repeat
start start copying data out when we get into chunks that are part of the wanted response.

Then repeat on B.

We have no control on the chunking sent to us. It MAY be any size chunks as long as it doesn't exceed the content length. I.E.we
might recieve 64kb chunks, or MTU sized chunks. This makes efficient range combining a nightmare.

Rob
Received on Sat Jan 13 2001 - 16:26:32 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:13:18 MST