[RFC] Have-Digest and duplicate transfer suppression

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Wed, 10 Aug 2011 10:11:09 -0600

Hello,

    I would like to add an optimization to Squid. I will describe the
optimization briefly and then ask whether this is something Squid
Project can accept.

    Folks using pairs of Squids over slow or low-bandwidth links often
want to optimize Squid-to-Squid communication. One such optimization is
avoiding transmitting the response body when the receiving Squid already
has such a body. This wasteful transmission can happen under at least
two scenarios:

    A) Two different URLs correspond to the same raw content bytes.
    B) A refresh of the same URL results in the same raw content bytes.

Case (A) has been studied extensively by Jeff Mogul and others. Jeff and
his team came up with a set of HTTP extensions for caches to advertise
"I have content with such and such checksum" information, which is then
used to avoid sending unchanged content to the cache. Here is one of
Jeff's papers:
http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf

Case (B) can be viewed as a sub-case of (A), but does not require extra
HTTP exchanges (bad for slow links!), a database of content digests, and
other complications of (A). The basic idea behind optimizing case (B) is
similar though:

   1) Child Squid has URL U cached. This Squid needs to request U from a
parent Squid (because the entity has expired, because the client
requested revalidation, etc.). The child Squid sends a regular request
for U to the parent Squid and tells the parent about the cached content
checksum:

    GET U HTTP/1.1
    Have-Digest: md5=foo
    ....

   2) Parent Squid processes the request for U as usual. This may or may
not result in getting fresh content from the origin server. If the
parent Squid realizes that the final response body it is about to send
to the child Squid has the same content, the parent does _not_ send the
message body. Instead, the parent tells the child to use the cached body.

To tell the child Squid that it can use the cached body, the parent
Squid can violate the HTTP message length rules and send the
regular/true response header without the body, but it is probably better
to just encapsulate the regular/true response header without violating
HTTP. There are several HTTP-compliant ways to do that. For example:

    HTTP/1.1 ??? Use What You Have
    Have-Digest: md5=foo
    Transfer-Encoding: use-what-you-have, chunked

    <just the encapsulated true HTTP response header here>

This approach does not violate HTTP but does require a little bit of
extra ICAP-like message composition and parsing by Squid.

The above is just a sketch. The exact details will differ (e.g., we will
need a TE request header if we want to use a custom transfer encoding,
which may imply that we do not need a Have-Digest header at all). Let's
not discuss those details for now.

Question: Can we accept a quality implementation of optimization (B)
into Squid?

It may be possible to avoid some Squid modifications by implementing
checksum computing and message manipulation logic in an adaptation
service, but most complex modifications must be done in Squid: At the
very least, we would have to modify Squid to preserve the cached body
while the request is pending and to allow an adaptation service to tell
Squid to use that cached body with updated headers. Thus, it feels wrong
to split a single feature implementation into in-Squid parts and
adaptation parts. What do you think?

Thank you,

Alex.
P.S. Case (B) is also related to Reload-into-IMS and such, but it is
more general and does not violate HTTP.
Received on Wed Aug 10 2011 - 16:11:29 MDT

This archive was generated by hypermail 2.2.0 : Sun Aug 14 2011 - 12:00:06 MDT