[RFC] If-Not-Digest and duplicate transfer suppression

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Fri, 14 Oct 2011 13:34:53 -0600

Hello,

    This is my second attempt to get the following optimization project
approved for Squid acceptance. The second version takes into account
most of the objections and ideas discussed earlier, while trying to
preserve the scope of the project.

Executive summary: The second version of the specs uses instance
digests, relies on RFC 3230 Digest and Have-Digest mechanisms, and uses
a new If-Not-Digest conditional request with a standard 304 response.
Motivation and details are documented below.

Folks using pairs of Squids over slow or low-bandwidth links often want
to optimize Squid-to-Squid communication. Both satellite and submarine
links were mentioned as applicable environment examples. One such
optimization is avoiding transmitting the response body when the
receiving Squid already has it. This wasteful transmission can happen
under at least two scenarios:

    A) Two different URLs correspond to the same raw content bytes.
    B) A refresh of the same URL results in the same raw content bytes.

Case (A) has been studied extensively by Jeff Mogul and others. Jeff and
his team came up with a set of HTTP extensions for caches to advertise
"I have content with such and such checksum" information, which is then
used to avoid sending unchanged content to the cache.

Here is one of Jeff's papers discussing how to optimize (A):
    http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf

And here is a related IETF RFC that defines Digest headers without
specifying how to use them for specific optimizations:
    http://tools.ietf.org/html/rfc3230

Case (B) can be viewed as a sub-case of (A), but does not require extra
HTTP exchanges (bad for slow links!), a database of content digests, and
other complications of (A). The basic idea behind optimizing case (B) is
similar though:

   1) Child Squid has URL U cached. This Squid needs to request U from a
parent Squid (because the entity has expired, because the client
requested revalidation, etc.). The child Squid sends a regular request
for U to the parent Squid and tells the parent not to bother sending the
same instance back:

    GET U HTTP/1.1
    If-Not-Digest: md5=foo
    ....

Please note that we do not want to use If-None-Match or a similar
standard ETag-based header because ETags do not specify/standardize
digest generation algorithms, and we do not want to confuse origin
servers by producing ETags that conflict with theirs. We can still use
ETags if they are available, of course, with or without Digests.

   2) Parent Squid processes the request for U as usual. This may or may
not result in getting fresh content from the origin server. If the
parent Squid realizes that the final response is 200 OK, and if the
instance it is about to send to the child Squid has the same digest,
then the parent MAY respond with 304 (Not Modified) instead.

  3) Servers (including parent caches) MAY advertise cached instance
checksums so that the child caches or clients may reuse them in future
GET requests:

    HTTP/1.1 200 OK
    Digest: md5=bar

  4) Clients (including child caches) MAY advertise their desire to
receive digest(s) for the requested instance:

    GET U HTTP/1.1
    Want-Digest: md5

The above functionality is meant to be compliant with Jeff's RFC 3230,
but items 1 and 2 are not directly covered by that RFC and differs from
Jeff's approach to solving problem (A): Jeff suggests using a HEAD
request to learn whether the parent cache has the same instance, but
that introduces extra RTT delays and does not work well enough for
misses and refreshes at the parent cache.

I believe many secondary details (not described above) are important but
should not affect the overall go/no-go decision. Here are some:

  a) handling of Range requests (I think we can use similar 304
responses there);

  b) optimal place to compute digests (it is the parent for the specific
use case I am addressing, but other use cases may place all the burden
on the child or share the burden);

  c) use of trailers to send Digest headers (with just-computed digests);

  d) inclusion of modified headers in 304 responses (to update cached
entity)

  e) how the optimizaton is enabled/configured (e.g., response size or
digest computation time limits)

  f) dealing with a combination of If-Not-Digest and other conditional
headers or cache control directives as well as ETags (e.g., the cache
admin may want to add If-Not-Digest to client's "reload" requests).

Question: Can we accept a quality implementation of the above
optimization into Squid?

Thank you,

Alex.
Received on Fri Oct 14 2011 - 19:35:19 MDT

This archive was generated by hypermail 2.2.0 : Thu Oct 20 2011 - 12:00:09 MDT