Re: gzip support

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sun, 15 Dec 2002 13:03:45 +0100

On Sunday 15 December 2002 08.52, Robert Collins wrote:

> Note that your idea MUST NOT be turned on by default. And should
> *never* be turned on for intercepting proxies. See the RFC 2616
> requirements for transparent proxies for more detail.

RFC 2616 "transparent" has nothing what so ever to do with
intercepting proxies. The RFC use of "transparent" is "semantically
transparent", i.e. does not change the semantic meaning of
requests/responses, including content negotiation. The same request
via the proxy and direct SHOULD give the same response entity where
applicable, and the same sideeffects on the origin server should be
performed in both cases (i.e. there SHOULT be no difference in a
POST/PUT sent via the proxy or direct etc..). Note that semantically
transparent does not by any means exclude caching.

You are very unlikely to find any references to "interception"
techniques in any standards track RFC documents as this is a very
gross violation to the IP standard. A single IP address MUST only
exists on a single host. A intercepting proxy simply steals the
origin servers IP address for the TCP session which is in full
violation of the single host per IP address requirement of IP.

A semantically nontransparent proxy (again, not related to
interception) is free to implement mostly any entity recoding into
other entitites such uncompressed->gzip/deflate, gif->png or whatever
just as a origin server may. In both cases the recoded entity is
another entity in terms of HTTP and you MUST adjust ETag accordingly
to separate the different variants from each other and SHOULD
indicate the HTTP-request depended rules used for determining which
entity to use via Vary.

note: You should not need to care about HTTP version in content
negotiation, and if you do there is not really any need to have this
reflected in Vary.

To get a good understanding of Content-Encoding it is best viewed as
the server driven content negotiation is is. A server capable of
providing content in multiple different formats for the same URL may
select the most suitable among the choices it have based on local
preferences configured in the server and which types are indicated as
acceptable by the request. These formats may either be dynamically
generated by engines like mod_gzip or generated earlier and stored as
separate files on the server. HTTP does not care how the different
variants of the same object is/have been generated. A gzipped object
is a gzipped object even if generated on the fly, and different from
the base entity (if such entity at all exists) just as a swedish
entity is different form an english entity or a gif is different from
a jpeg or bmp format of the same image.

Strong ETag headers are used to uniquely identify a specific entity
for on given URL if needed. A strong ETag is guaranteed to uniquely
identify the specific entity down to the binary bits, and is needed
to safely merge ranges from two requests.

A weak ETag allows for semantic comparasiation of two entitites, such
as a gif vs a jpg of the same image, or identity vs gzip encoding.
Two response entities of the same URL with the same weak ETag is said
to be semantically exchangeable to the user, but is not the same
entity in terms of HTTP. For example a gzipped object MAY have the
same weak ETag as the base object, or a image/gif content may have
the same weak ETag as a image/jpg if they represent the same image.
However, you cannot merge ranges from objects with weak ETag, as the
weak ETag does not tell if the object is binary equivalent, only that
to the end user it (the object as a whole) has the same meaning.

And no, it is not possible to get all this correct in terms of caching
until the servers, clients and caches properly support at a minimum
Vary and most preferable ETag as well. ETag support is still missing
from Squid, but there exists a patch to Squid-2.5 with at least a
basic implementation.

ETag is needed to guarantee cache consistency when merging ranges,
verifying freshness etc.

Note: Ranges of a "Content-Encoding: gzip" object is ranges within the
gzipped content, not within the identity encoded object.

Implementing a content reencoding proxy which affects any of the
Content-Encoding/Type/Language header requires a great deal of care
and understanding of HTTP not to paint yourself into yet another
corner making it even harder for the general HTTP compliance to move
forward.

What can however be fully safely implemented in a HTTP/1.1 proxy
without too much hassle with HTTP semantics is Transfer-Encoding.
Transfer-Encoding preserves the same identity of the object, only
compresses the data (for gzip/deflate transfer encoding) for the
purpose of transmission. A HTTP/1.1 cache MAY precalculate gzip
encoded reply bodies to increase performance.

Note: The lack of Transfer-Encoding support in Squid is THE major
reason why Squid is still HTTP/1.0.

> This will be interesting :}. You'll need to start by coding a
> storeobject->storeobject copy engine. Definately doable though.

The existing storeclient and store data provider interfaces
(StoreAppend etc) is quite sufficient for the job of content
reencoding I think. Finding and reading the cached object is pretty
straight forward.

Things may be a little more complicated if/when the store data
provided is changed to a pull model (needed to get rid of deferred
reads), but I guess a interface similar to the current push model
(storeAppend etc) will still be provided for internally generated
static objects then. This is however all in the future.

> Having said all that, this could be very useful for reverse
> proxies.

Indeed.

A reverse proxy in terms of HTTP is an origin server and is free to do
basically whatever it wants. The semantics of a reverse proxy is a
HTTP server who uses HTTP to fetch the content + content headers
instead of local files + configuration (extension->mime tables etc).

Regards
Henrik
Received on Sun Dec 15 2002 - 05:03:23 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:19:01 MST