Re: modules from Terence Kelly on 1999-05-31 (squid-dev)

From: Terence Kelly <tpkelly@dont-contact.us>
Date: Mon, 31 May 1999 20:15:56 -0400 (EDT)

On Mon, 31 May 1999, Alex Rousskov wrote:

> On Mon, 31 May 1999, Terence Kelly wrote:
>
> > I'm aware of exactly one
> > widely-cited paper that deals with reasonable cache sizes: Rizzo &
> > Vicisano's "Replacement Policies for a Proxy Cache". They use the
> > DEC trace (22M requests before filtering) and simulate up to 10GB.
>
> Sigh. I was hoping for something fresh. 1996 DEC traces are getting quite
> old. Also, I do not recall _significant_ improvements over LRU in their
> paper, but apparently I have to re-read it.

Define "significant". :) If algorithm X consistently outperforms
algorithm Y and isn't much harder to implement correctly and
efficiently, why shouldn't we prefer X? I don't yet have a good feel
for the kinds of implementation issues & constraints that real cache
developers face. Could you provide a list of good & bad qualities
for removal policies, for the benefit of researchers?

For instance, does it matter whether a replacement policy operates in
O(N) or O(log N) time? Is it feasible for an LFU-like policy to
store referene counts on URLs even after they are evicted?

> > In my WCW99 presentation I showed data on cache sizes up to 8GB on
> > two-week NLANR request streams of 5-7 million requests. Now I'm
> > working with six 28-day NLANR request streams, each of which contains
> > between 10 and 35 million requests. I'm simulating caches up to
> > 32GB. My data show that LFU with an aging mechanism offers
> > substantially higher byte hit rates than LRU; for instance, on our
> > largest trace, an 8GB LRU cache gives you a 51% BHR whereas LFU gives
> > you 55%. I'll have a paper ready for submission in a few weeks.
>
> Hmm.. Except UC and SD (which are special), NLANR caches get ~15% byte hit
> ratio (max 25%) despite cache sizes larger than 8GB. Either you are not
> accounting for updates, reloads, expirations etc. or Squid LRU is *really*
> far from classic LRU. Based on statistics reported by other cache admins, I
> would suggest the former (not too many people can brag about 50% BHRs
> without violating HTTP and doing other extreme optimizations). Note that
> NLANR caches (or traces) have even lower than average hit rates because of
> our location in the hierarchy.

The data I reported above is from the SD site, 1-28 March 1999. If
the number still smells wrong, the discrepancy likely arises from the
way I filter NLANR access logs prior to running simulations. I
remove from consideration entirely all requests with HTTP reply code
other than 200, and all requests that appear to be for dynamic
content. The size of a URL is the maximum transfer size ever
reported for it, and leading "http://" and trailing "/" strings in
URLs are not significant, so that all of the following are considered
to be references to the same URL:

    http://foo.com/
    http://foo.com
    foo.com

I assume that URLs do not change, and I look at successful requests
for URLs not present in client caches (thus code 304 log entries are
discarded). I know this isn't realistic, but until something like
an MD5 hash of URL data is included in log entries, it won't be
possible to establish with certainty whether the "foo.com" requested
this time is the same one requested last time based on the logs.

Suggestions for increasing the realism of my work at reasonable cost
are very welcome.
Received on Tue Jul 29 2003 - 13:15:58 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:09 MST