Re: [RFC] Have-Digest and duplicate transfer suppression

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 15 Aug 2011 09:50:04 -0600

On 08/14/2011 04:36 AM, Henrik Nordström wrote:
> ons 2011-08-10 klockan 10:11 -0600 skrev Alex Rousskov:
>
>> A) Two different URLs correspond to the same raw content bytes.
>> B) A refresh of the same URL results in the same raw content bytes.
>
> Both are very interesting I think.

Yes, but our focus is on (B), at least for now.

> And I would take a simpler approach. Build on the HTTP Instance Digest
> defined by Jeff, and always add a suitable instance digest to
> cached/buffered content (this regardless of the use of Want-Digest). Any
> received instance digests MUST be verified befora cache reuse. If the
> received message have the same instance digest as a previously cached
> instance then abort the retreival and reuse what you have in the cache.

I do not like aborted retrievals as the default method of handling a
digest-based hit. Aborted transactions have negative side-effects and
some of those effects are not controlled by Squid (e.g., monitoring
software may trigger an alert if too many requests are aborted).

I agree that we can switch from entities to instances, provided we are
OK with excluding 206, 302, and similar non-200 responses from the
optimization. By instance definition, Squid would not be able to compute
or use an instance digest if the response is not 200 OK. We can hope
that the vast majority of non-200 responses are either not cachable or
are very small and not worth optimizing.

> In requests you can optionally add an digest based condition similar to
> If-None-Match but here If-None-Match already serves the purpose quite
> well, so use of the digest condition should probably be limited to cases
> where there is no ETag.

Or to cases where ETag lies about response content changes.

> To optimize bandwidth loss due to unneeded transmission a slow start
> mechanism can be used where the sending part waits a couple RTTs before
> starting to transmit the body of a large response where an instance
> digest is presented. This allows the receiving end to check the received
> instance digest and abort the request if not interested in receiving the
> body.

Besides my general dislike for aborted transactions becoming a norm (see
above), "a couple RTT" delay is a high price to pay because each RTT is
a few seconds already.

> I probably would not advice to go the route by message digests &
> hop-by-hop. The main difference between message digests and instance
> digests is their meaning in 206 responses. Message digests mainly serve
> the purpose of very weak integrity protection detecting accidental
> in-transit modifications to a given message and their use outside that
> scope is pretty limited.

The proposed optimization for (B) enlarges that scope. As discussed
above, I think it would be OK to switch to instance digests instead, but
we would have to give up ability to optimize body transfer for cached
non-200 responses.

> The drawback of the above proposal is that it can not deal well with
> partial objects where the full representation is not known to the
> upstream cache. But for that case I think we need to rely on ETag being
> presented by the server. If that is not sufficient then a new type of
> digest needs to be defined which can be calculated over ranges of an
> instance (not the 206 message representation as done in Content-MD5 if
> applied at message level, which bts is something I disagree was the
> intention for Content-MD5)
>
> Note regarding Content-MD5. It's use in 206 responses have been
> deprecated in HTTPbis as there is inconsistent implementations and no
> clear consensus on the meaning of Content-MD5 in 206 responses.

I agree that we should not use Content-MD5. We need to define a new
digest from scratch or use Jeff's instance digests.

>> Case (A) has been studied extensively by Jeff Mogul and others. Jeff and
>> his team came up with a set of HTTP extensions for caches to advertise
>> "I have content with such and such checksum" information, which is then
>> used to avoid sending unchanged content to the cache. Here is one of
>> Jeff's papers:
>> http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf
>
> Trouble with Jeffs proposal and other similar approaches is the added
> overhead in discovering that there is two objects with identical
> representations. I do not like the proposal by Jeff as it adds
> significant amount of latency which is a major bottleneck today, and
> optimistically sending some digests of other URLs is not practical and
> adds some nasty security implications (plus that it significantly adds
> request bandwidth overhead)
>
> If case (A) is to be addressed then I would do so in a more relaxed
> manner like what I describe above.

I am not interested in supporting case (A) at this time (in part for the
same reasons you mention above), but others might be.

>> Case (B) can be viewed as a sub-case of (A), but does not require extra
>> HTTP exchanges (bad for slow links!), a database of content digests, and
>> other complications of (A). The basic idea behind optimizing case (B) is
>> similar though:
>
> Case (B) is mainly to optimize the case where servers not support ETag.
>
> If servers do send ETag (and do not randomly change them for tracking
> purposes) then If-None-Match is sufficient for (B).

Yes.

> Extending (B) with an Instance-Digest based condition may be interesting
> to deal with the numerous servers not sending ETag or where ETag is used
> badly.

Yes, that is exactly the optimization target.

>> 1) Child Squid has URL U cached. This Squid needs to request U from a
>> parent Squid (because the entity has expired, because the client
>> requested revalidation, etc.). The child Squid sends a regular request
>> for U to the parent Squid and tells the parent about the cached content
>> checksum:
>>
>> GET U HTTP/1.1
>> Have-Digest: md5=foo
>> ....
>
> Have-Digest: should be an If-something imho. If-None-Digest-Match ?
>
>
>> To tell the child Squid that it can use the cached body, the parent
>> Squid can violate the HTTP message length rules and send the
>> regular/true response header without the body, but it is probably better
>> to just encapsulate the regular/true response header without violating
>> HTTP.
>
> Why not simply use 304 which already exists for the purpose?
>
> A 304 provides entity headers and body identifier.

We wanted to be able to optimize transfer of non-200 responses and be
able to update headers. In other words, we wanted the child cache to be
able to restore the exact origin server response, including status code
and header details.

If we focus on 200 responses exclusively and do not support updating
headers prohibited by 304 responses, then 304 is the best response
status code, of course.

>> Question: Can we accept a quality implementation of optimization (B)
>> into Squid?
>
> I would rather see one that can be extended to work for (A) than just
> optimizing (B). The amount of redundant data on the web is very large
> today.
>
> Additionally as already mentioned by Amos, If-None-Match is an already
> existing mechanism for dealing with (B), and a good first step is fixing
> our implementation of that.

Fixing relevant parts of If-None-Match support can indeed be a part of
the project. I do not think we can do (A), but I agree that it would be
nice if our solution for (B) can be, at least theoretically, extended to
(A) later.

>> P.S. Case (B) is also related to Reload-into-IMS and such, but it is
>> more general and does not violate HTTP.
>
> Reload-into-IMS is a bastard because it adds a quite weak validator
> (If-Modified-Since) to the request when none were send by the client,
> possibly resulting in stale content being served as fresh from the
> cache.
>
> Adding strong conditions to forwarded requests have a much more limited
> impact and I have a hard time see this causing any issues, provided the
> part that adds the condition is prepared to deal with the possible
> outcomes.

I agree that the original proposal for (B) should not cause
Reload-into-IMS-problems. However, if we go the adjusted route of 304
responses and ignoring certain origin server response headers, we may
create similar, albeit less likely problems: The client will receive the
right message body with wrong/stale headers because 304 responses
prohibit inclusion of certain headers.

On the other hand, we can violate an RFC 2616 SHOULD and include all
headers:

> If the conditional GET used a strong cache validator (see section
> 13.3.3), the response SHOULD NOT include other entity-headers.

>> If we use the If-None-Match approach, it would be just
>> >
>> > If-None-Match: edigest_md5=foo
>> >
>> > or similar.

> If-None-Match syntax do not allow for extensions.

My bad. I forgot entity-tag does not have a "kind=" prefix. One more
reason _not_ to use If-None-Match even if we are reusing its semantics.

> But yes, it can obviously be extended in non-ambiguous ways without
> conflicting with existing use. But any such extensions need to be
> enabled carefully as we cannot assume the receiving end can parse them
> at all and may reject the header of complete message as invalid.

Or may crash, serve the wrong response, etc.

Thank you,

Alex.
Received on Mon Aug 15 2011 - 15:50:34 MDT

This archive was generated by hypermail 2.2.0 : Tue Aug 16 2011 - 12:00:03 MDT