Re: [RFC] Have-Digest and duplicate transfer suppression

From: Henrik Nordström <henrik_at_henriknordstrom.net>
Date: Sun, 14 Aug 2011 12:36:35 +0200

ons 2011-08-10 klockan 10:11 -0600 skrev Alex Rousskov:

> A) Two different URLs correspond to the same raw content bytes.
> B) A refresh of the same URL results in the same raw content bytes.

Both are very interesting I think.

And I would take a simpler approach. Build on the HTTP Instance Digest
defined by Jeff, and always add a suitable instance digest to
cached/buffered content (this regardless of the use of Want-Digest). Any
received instance digests MUST be verified befora cache reuse. If the
received message have the same instance digest as a previously cached
instance then abort the retreival and reuse what you have in the cache.

In requests you can optionally add an digest based condition similar to
If-None-Match but here If-None-Match already serves the purpose quite
well, so use of the digest condition should probably be limited to cases
where there is no ETag.

To optimize bandwidth loss due to unneeded transmission a slow start
mechanism can be used where the sending part waits a couple RTTs before
starting to transmit the body of a large response where an instance
digest is presented. This allows the receiving end to check the received
instance digest and abort the request if not interested in receiving the
body.

I probably would not advice to go the route by message digests &
hop-by-hop. The main difference between message digests and instance
digests is their meaning in 206 responses. Message digests mainly serve
the purpose of very weak integrity protection detecting accidental
in-transit modifications to a given message and their use outside that
scope is pretty limited.

The drawback of the above proposal is that it can not deal well with
partial objects where the full representation is not known to the
upstream cache. But for that case I think we need to rely on ETag being
presented by the server. If that is not sufficient then a new type of
digest needs to be defined which can be calculated over ranges of an
instance (not the 206 message representation as done in Content-MD5 if
applied at message level, which bts is something I disagree was the
intention for Content-MD5)

Note regarding Content-MD5. It's use in 206 responses have been
deprecated in HTTPbis as there is inconsistent implementations and no
clear consensus on the meaning of Content-MD5 in 206 responses.

> Case (A) has been studied extensively by Jeff Mogul and others. Jeff and
> his team came up with a set of HTTP extensions for caches to advertise
> "I have content with such and such checksum" information, which is then
> used to avoid sending unchanged content to the cache. Here is one of
> Jeff's papers:
> http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf

Trouble with Jeffs proposal and other similar approaches is the added
overhead in discovering that there is two objects with identical
representations. I do not like the proposal by Jeff as it adds
significant amount of latency which is a major bottleneck today, and
optimistically sending some digests of other URLs is not practical and
adds some nasty security implications (plus that it significantly adds
request bandwidth overhead)

If case (A) is to be addressed then I would do so in a more relaxed
manner like what I describe above.

> Case (B) can be viewed as a sub-case of (A), but does not require extra
> HTTP exchanges (bad for slow links!), a database of content digests, and
> other complications of (A). The basic idea behind optimizing case (B) is
> similar though:

Case (B) is mainly to optimize the case where servers not support ETag.

If servers do send ETag (and do not randomly change them for tracking
purposes) then If-None-Match is sufficient for (B).

Extending (B) with an Instance-Digest based condition may be interesting
to deal with the numerous servers not sending ETag or where ETag is used
badly.

> 1) Child Squid has URL U cached. This Squid needs to request U from a
> parent Squid (because the entity has expired, because the client
> requested revalidation, etc.). The child Squid sends a regular request
> for U to the parent Squid and tells the parent about the cached content
> checksum:
>
> GET U HTTP/1.1
> Have-Digest: md5=foo
> ....

Have-Digest: should be an If-something imho. If-None-Digest-Match ?

> To tell the child Squid that it can use the cached body, the parent
> Squid can violate the HTTP message length rules and send the
> regular/true response header without the body, but it is probably better
> to just encapsulate the regular/true response header without violating
> HTTP.

Why not simply use 304 which already exists for the purpose?

A 304 provides entity headers and body identifier.

> Question: Can we accept a quality implementation of optimization (B)
> into Squid?

I would rather see one that can be extended to work for (A) than just
optimizing (B). The amount of redundant data on the web is very large
today.

Additionally as already mentioned by Amos, If-None-Match is an already
existing mechanism for dealing with (B), and a good first step is fixing
our implementation of that.

> P.S. Case (B) is also related to Reload-into-IMS and such, but it is
> more general and does not violate HTTP.

Reload-into-IMS is a bastard because it adds a quite weak validator
(If-Modified-Since) to the request when none were send by the client,
possibly resulting in stale content being served as fresh from the
cache.

Adding strong conditions to forwarded requests have a much more limited
impact and I have a hard time see this causing any issues, provided the
part that adds the condition is prepared to deal with the possible
outcomes.

Regards
Henrik
Received on Sun Aug 14 2011 - 10:36:45 MDT

This archive was generated by hypermail 2.2.0 : Mon Aug 15 2011 - 12:00:03 MDT