Re: [squid-users] Ideal cache placement (was Re: Why Squid is great (was: fourth cache off??))

From: Jon Kay <jkay@dont-contact.us>
Date: Thu, 20 Dec 2001 12:58:57 -0600

> I think this is a potentially dangerous argument to make without
> any supporting evidence.

"Dangerous." I like that.

Finally I have an excuse to buy some black leather.

Actually, I do have supporting evidence. See [OSDI98], at
http://www.pushcache.com/tr98-04.pdf

> To demonstrate why, let's push this to its natural conclusion, where
> each users has his/her own private cache. In this scenario, it's hard
> to see how the caches are of much benefit, because of the lack of
> sharing.

I remember reading about this debating technique when I was 14.
"Reductio ad absurdem." Let me try it on your argument.

OK, so if we should centralize everything, then let's have One Big
Fast Cache in the center to serve all requests.

      One Cache to rule them all, One Cache to find them,
      One Cache to bring them all and in the darkness bind them.

That is, when it isn't crashing from overload.

Of course, it'll take forever and a day for user requests to get there
and make it through the backed-up request queue, but never mind: it
will respond with near-oracular wisdom.

I have heard rumors that AT&T actually deployed such a thing. We all
know how they are a caching industry leader now.

Back in the Real World (tm)(r), though, let's rewind that last
sentence of yours and take a look at it.

> In this scenario, it's hard to see how the caches are of much
> benefit, because of the lack of sharing.

...unless they are running hint cache or Cache Digests or ICP or even
old-fashioned simple hierarchies. In that case, there will be sharing.

Plus, I didn't suggest one box per user, but one box per internal group.
In the real world, people setting up caches are able to make reasonable
decisions about these things.

> Adding new communications into the mix may not be a clear
> win, especially if the extra communication doesn't scale well with the
> number of caches.

What "extra communication that doesn't scale well with the number of
caches" are we talking about?

> Our experience, and the experience of others, seems to indicate that
> a cache's hit rate generally increases as the client population
> increases. This concept is the reason why you'd expect any bandwidth
> savings from hierarchical caches. The drawback to hierarchical caches
> is the additional latency involved in the hierarchy - much worse
> than router hops or line losses. There's a reasonably good paper
> that puts things in perspective:
> http://www.cs.washington.edu/research/networking/websys/pubs/sosp99/

That paper presents results that appear to suggest that any kind of
cooperative cache cloud is a waste of time. It arrives at that
conclusion by doing simulations of algorithms that poorly approximate
the reality, with cache population distributions that are just silly.

Hint Cache and Cache Digest are more like the "directory" caches in
that paper. Except that the "directory" simulated is completely
different from the workings of any actual "directory" scheme out
there. It approximates hint cache, but is wrong in two key ways:

1) On leaf cache misses / hint cache hits, hits are redirected to
   the CLOSEST cache holding it, NOT the middle of the cloud. With
   middle-of-cloud, we might as well stick with hierarchical and not
   waste time with all this hacking.
     1a) Cache Digests leaf miss / digest hits, hits go to neighbors ONLY.

2) A population per cache of 50,000 is already wrong, and 500,000 and
   5,000,000 are even more ludicrous. . The reason why hint caches
   are a good idea is that that size of cache is already going to
   result in big slowdowns. Hint caches are designed to have
   per-cache client populations in the hundreds to thousands,
   at most tens of thousands, a region that is easy to get good
   hit latency for.
   
   It is plain from the fact that they start their user populations at
   the point of diminishing returns that they chose their numbers to
   make a more interesting and controversial paper rather than to
   enlighten.

   In fact, even that "diminishing return" region is pretty suspect.
   It doesn't even begin to show up until you have an order of
   magnitude more users than they actually had. They need to explain
   a bit more about that "linear fit" of theirs. That cacheable
   region which they say flattens is still going strong -
   inverse-exponentially linearly, as dictated by Zipf's Law -
   when they run out of real data.
Received on Thu Dec 20 2001 - 12:00:31 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:05:26 MST