Re: refresh patterns and other nitpicks

From: WWW server manager <webadm@dont-contact.us>
Date: Sat, 15 Nov 1997 13:27:20 +0000 (GMT)

Stefan Monnier wrote:
>[snip]
> In the same kind of unimportant requests:
>
> how about indexing the cache based on an hash of the content ?
> This way, if a page is accessible via two different URLs, it will be cached
> only once (but will of course still miss twice, this seems unavoidable).

I think that would be problematic. The cacheability and the amount of time
for which something can be cached depends on a lot of factors (far more than
content typ, the topic of the *snipped* part of the quoted message). You
certainly could not assume that refreshing (via a particular URL) a document
that was referenced by 500 URLs on 100 different servers would give you a
current version that was valid for all the 500 URLs, and the header
information from some sites might indicate that revalidating should be done
every 30 minutes while for others 30 days might be OK (while yet other sites
might return the same document in a form which was uncacheable; should that
render them all uncacheable, since they are "the same"?).

Thinking about this briefly, it looks like the only safe/reliable way to do
this would involve store response headers separate in the cache from the
document content, and allowing everything to proceed as normal (including
revalidating each cached URL independently), with the "only" difference being
that there would only ever be a single copy of the document content for each
unique document, accessed indirectly if examination of the cached headers
for a requested URL showed the content should still be considered current
for that URL. There would be a lot of added complexity, though - e.g.
keeping track of which document content matched each set of response
headers, and recognising when there were no response headers referencing a
particular "content" file, so that it should be deleted.

I suspect that getting this right (including recovery after system crashes
which left the various files inconsistent, etc.) would be "tricky". And that's
ignoring related issues like the possibility of hash collisions such that
requests get the wrong content returned (which as noted in other recent
messages can be "unfortunate", depending on the nature of the wrong
content...).

                                John Line

-- 
University of Cambridge WWW manager account (usually John Line)
Send general WWW-related enquiries to webmaster@ucs.cam.ac.uk
Received on Sat Nov 15 1997 - 05:32:59 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:37:39 MST