Antwort: Re: Antwort: Re: Antwort: [Mod_gzip] Vary: header and mod_gzip

From: <Michael.Schroepl@dont-contact.us>
Date: Wed, 28 Aug 2002 02:52:38 +0200

Hi Henrik,

> What you need to care about is the rules for THIS object
> content for a specific URL based on the request headers
> or other external input. Any static rules based on the
> actual response object does not need to be mentioned,
> neither do you need to mention "random" rules depending
> on internal server state independent of the user unless
> you really want to (see below). A threshold rule telling
> that all responses above a certain size may be compresed
> is a typical static rule that does not need to be mentioned.
> For the same object the rule will always trigger in the
> same manner.
> If your server have dynamic rules that might give different
> responses for the exact same request and URL with no changes
> in the content then you should include a "Vary: *" header
> to indicate that special content negotiation rules apply
> that cannot be expressed in terms of HTTP and that the
> server must therefore always be queried on which response
> entity is the correct one for this user. I don't think this
> really applies to mod_gzip. In such case you really SHOULD
> support ETag and If-None-Match or else caching in shared
> caches is kind of pointless as the cached content then
> never can be reused..

At this point of the discussion I believe the correct
implementation in mod_gzip might be the following:

- Parse the Apache configuration record that applies
  for this URL - very similar to what the mod_gzip
  internal rules validator is already doing.

- While doing so, collect all include/exclude rules
  of the type "reqheader". The operand-1 values of
  these rules are the HTTP headers that this request
  _may_ be conditional upon; each rule will have one
  HTTP header name only.

- Add "Accept-Encoding" to the list of HTTP headers
  collected this way.

- Form a "Vary:" http header and output the list of
  these headers as parameters.

This will compute the _maximum_ HTTP header names
list that the compression decision may depend upon.

The C function "mod_gzip_validate1" (from line 738 to
line 1219) is already parsing these rules - this one
should be the reference implementation for finding
out which HTTP headers might potentially be relevant.

But currently I am not able to actively code in C a
function that would be doing what I described above
- I have neither the experience in Apache internal
API structures nor in C coding.

Maybe someone on the list can help out? All I can do
is provide algorithms and discuss dependencies.

> The minimum requirement of Vary is to include information
> expressing to caches who might receive this kind of reply.
> For mod_gzip the minimal requirement is that compressed
> content may never be sent to user-agents not supporting
> comression, and this can easily be expressed in terms of
> Vary. (see below)

An alternative way to get a mod_gzip version working
correctly might of course be this:
a) look at your mod_gzip configuration, collect all
   "mod_gzip_item_**clude reqheader" rules.
b) Add the list of the HTTP headers used there to the
   "Vary:" line in the patched mod_gzip 1.3.19.1b
   source code.
c) recompile mod_gzip.

This will reliably send all potentially relevant HTTP
header names in the "Vary:" header - unless of course
the admin changes his configuration and doesn't update
his patch. :-(

> If the reply is such that mod_gzip might compress the
> reply for certain browsers/users then you should include
> a Vary header.

This can only be if there is at least one of those
"reqheader" rules being used in the configuration.

> If the reply is such that mod_gzip would never compress
> the reply no matter who requested it then no Vary header
> should be included.

I understand, but finding out this fact from the
current rule system might be a _very_ tricky thing.
(sigh, the next couple of sections will be very
boring for some readers, I am afraid.)

This would happen if at least one 'reliable' (see
below) "exclude" rule would fire. Remember that
there are at least two possibilities why mod_gzip
would not compress:
a) there is no "include" rule firing,
b) there is at least one "exclude" rule firing.
(And there are the other reasons, not related to
HTTP headers directly.)

But now we can classify the mod_gzip rules.

Given a constant Apache configuration and file
set (!, see below), the items "file" and "uri"
would lead to reproducable reactions, so if any
"exclude" rule of one of these types will fire
then mod_gzip will definitely not compress this
request.
"mime", "handler" and "rspheader" rules are ir-
relevant for the "Vary:" list, although not ir-
relevant for the compression decision;
"reqheader" rules are not reliable, as they
refer to content of HTTP headers being sent by
browsers.

So if an "exclude" rule of the "file" or "uri"
type will fire then it would fire reproducably
(under the restrictions named above).
If we have an exclude of one of these types,
mod_gzip is able to know that a request will
never be compressed. No "include" is able to
override an "exclude".

Unfortunately, the code structure of mod_gzip
does not support this kind of information very
well: The validate function (which is evaluating
the configuration rules) quits checking once it
has found some firing "exclude" rule, and doesn't
care about other rules.
Furthermore, the return value of the validate
function is a boolean - the caller has not even
a clue _why_ the validate function decided to
accept or decline a request.
This has not been a requirement until now, and
surely the method of quitting the rules analysis
is a performance optimized behaviour, but not
sufficient to identify the exact information to
create a "Vary:" header like we now want it.

_But_ there is a big caveat. I stated that a
constant configuration _and_ file set would be
a prerequisit for the behaviour of the "file"
and "uri" rules. Let me explain why.
The problem is URL translations inside Apache.
Let a user agent request the URL "/", which will
at this moment be mapped to "/index.html" by
Apaches directory defaulting mechanism.
The mod_gzip configuration may include a "file"
rule that makes *.html being compressed - but
let there be just no "uri" rule that would do
the same. An absolutely normal mod_gzip confi-
guration - "file" rules are very reliable, no
other 3rd-party Apache module can easily change
the name of a file, while there are modules that
do internally change the URI (like mod_ssl and
mod_proxy who both even change the "protocol"
part of the URI), let alone URL translations
of all kinds (mod_rewrite etc.).

Now let some user remove the "index.html" file
and replace it with some "index.shtml" file.
This will now be selected as file to map the
"/" request, given an appropriate priority
list in the corresponding Apache directive.
("index.*" is quite common in these cases.)
It will be parsed by the Server Side Include
handler, and this will lead to a content that
will contain "chunked" data - this is how
Apache 1.3 implements SSI and CGI.
mod_gzip has a configuration directive to enti-
tle it to collect these chunks, remove the chun-
king information, compress the whole packet and
send it to the client. This is additive to the
whole rule set - compressing a SSI content re-
quires both the "dechunk" option being activated
_and_ some configuration rule to explicitly ac-
cept this object for compression.
So even the knowledge of the _whole_ item rule
set would not be enough to decide whether a
request will be compressed always or never.
The effective meaning of the "/" URL can change
any time without any change within the mod_gzip
configuration, because Apaches URL mapping is
dynamic anyway, which may well put the same
URL from a file that will be compressed always
to a file that will be compressed never or
anything else.
Right now I doubt whether there may be a solution
even if you evaluate the whole Apache configura-
tion knowledge, because just removing or adding
a file can change in the document tree is enough
to change the effective meaning of some URL and
change its "compression behaviour class".
One may be able to find out _whether_ such a
translation has taken place or not (I guess
this knowledge to be available in some Apache
request representation record), but _if_ it
has taken place there will probably be _no_
way to find out whether a "Vary:" header
_needs_ to be sent or not. (And this applies
to each and every request for an URL that is
ending in a "/"!)
If not, then the logic described above _might_
be a clue to code something that will prevent
sending the "Vary:" header in a reasonable set
of cases.

And then, there are configuration options that
are based on the combination of the features
of mod_gzip and Apache.
You can write some Apache configuration section
like
     <LocationMatch *.html$>
      mod_gzip_on no
     </FilesMatch>

This _might_ be semantically identical to

     mod_gzip_item_exclude uri \.html$

... not even I am sure about this right now.
But both methods are allowed, and in both cases
this URI would never be compressed ... I think.

So don't be too optimistic about mod_gzip telling
you that a request will never be compressed - this
isn't easy to find out if you want to detect all
of these cases. mod_gzip is simply too powerful.
(But detecting _some_ of them would already help
- in fact each case that will end up without a
"Vary:" header would help.)

> Likevise if the configuration is
> such that mod_gzip would always compress the reply
> no matter who requested it.

I am afraid it is even _more_ difficult to find out
about this one.
Again the validate function is looking for _at_least_
one include rule to fire, and when it has found one,
it doesn't care about other rules.
This rule might well be one of the "reqheader" ones
that vary from request to request, while there might
well be another rule (from the "file" or "uri" type)
that will reliably fire for every request.

And even more difficult, there would have to be
no exclude rule of _any_ of the other rule types,
and no other exclusion mechanism as well (there are
several other reasons for mod_gzip to not compress).

Even if there were an algorithm to accurately find
out whether a request will always be compressed, in
reality I doubt that there will be a single produc-
tive mod_gzip configuration with a rule set that
would allow for any such request. Just let the
minimum required HTTP level be anything but a
wildcard, or let the dechunking switch be set off,
or have some file size limit active, or ... in all
these cases you have no chance to give any grant
that a request will always be compressed.
There are just so many reasons to exclude that at
least _one_ of these _might_ always apply.

Anyway, to find out about which _class_ of rule
would cause a request to be compressed would
require the rule evaluation concept to be changed.

Maybe some priority list model would help:
a) classify the mod_gzip rule types (level 1: "uri"
   and "file"; level 2: "reqheader"; level 3: the
   remaining three rule classes)
b) make the validate routine scan for them in some
   order so that the maximum information usable for
   the "Vary:" header to be created later can be
   stored in some data structure
c) use this information when the time has come to
   _decide_ whether a "Vary:" header should be
   sent and which information it should contain.
But as shown above, this will not be enough to even
reliably find out whether a "Vary:" header (beyond
"Accept-Encoding, that is) will be needed or not.

> Alternative 1 is the "correct" one, telling caches exactly
> what to do and provides optimal hit ratio if the HTTP server
> and cache is capable of ETag and If-None-Match..

At the moment, I am not able to discuss the ETag and
If-None-Match issues. I understand the concept of
ETag (because I know exactly one browser that is
sending ETag HTTP headers: Opera 6), but I am not
sure what mod_gzip would have to do in the
"If-None-Match" area.
Please forgive me if I concentrate on the "Vary:"
issue first and learn the rest later. ;-)

>> There have been examples in the past where "mod_gzip_item_exclude
>> reqheader" has been used to detect proxy servers that are known
>> to unconditionally store compressed content ...
> Squid DOES NOT unconditionally cache compressed content ...
>> Or resulting in the mod_gzip configuration adding the proxy to a
>> non-compression blacklist, denying compressing for all requests
>> coming from this direction - if the proxy tells who it is.
> Sorry, I do not see the point here in this discussion. Squid
> is doing the best it can. mod_gzip has intentionally selected
> to tell Squid and other caches to do wrongly, why should then
> mod_gzip users blacklist Squid and other caches rather than tell
> them correct information?

Please let me apologize for being unclear.
I didn't mean Squid, neither explicitly nor implicitly.

> What you should do in such case is to fall back onto the failsafe
> approach and send back a uncompressed reply.
> Do not build "whitelists" of browsers known to send Accept-Encoding:
> gzip, for such browsers you should use Accept-Encoding exclusively.
> You do not know why the "Accept-Encoding" header has been excluded.

And even if I did, I would be aware of the fact that browsers
that are not asking for gzipped content currently deny to
understand it, even if they have the decompression code
implemented.
So sending compressed content if no "Accept-Encoding: gzip"
at mod_gzip is simply no option anyway.

> Note: Squid-2.X always sends a Via: header in the request unless
> intentionally disabled by the cache administrator for privacy
> reasons. Not that I presonally think this is something you should
> make use of in mod_gzip, but you asked..

Do you consider mod_gzip to behave like a cache in this area?
Would this help anyone?

I am asking mostly for other reasons:

I myself have an implementation of a compressing HTTP cache
(a Perl CGI script) that is to be embedded into Apache via
the "AddHandler" hook.
(And yes, I do send the "Vary" header. ;-)

If you are interested, please read
     http://www.schroepl.net/projekte/gzip_cnc/
and I would be glad to learn about everything this program
is doing right or wrong about handling the HTTP protocol ...

Greetings, Michael
Received on Tue Aug 27 2002 - 21:16:47 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:16:14 MST