CARP revisited: a better solution?

From: Scott Lystig Fritchie <fritchie@dont-contact.us>
Date: Wed, 08 Apr 1998 23:36:29 -0500

Though I don't operate a large cluster of caching proxy boxes (yet),
the CARP "protocol" draft & MS's propaganda caught my interest. I
agree that it would be a good thing if a "cluster" of caching M
proxy boxes with N gigs of storage each, it would be nice if the
cluster had an effective cache size of MxN gigs instead of N gigs. (I
don't know if the latter actually happens with ICP-communicating
proxies end up doing that, but MS's argument sounds plausible to me.)

What if there were a state midway between the usual caching proxy mode
and the proxy only mode:

        * if a file is retrieved DIRECT or from a "far-away" parent
        or sibling cache, then cache the file yourself.
        * if an positive ICP response from a "near" sibling is
        received, retrieve the file from that sibling but don't cache
        it yourself.

MS argues that ICP traffic would increase exponentially as boxes are
added to the cluster. Assuming that the multicast ICP code is robust,
problem solved, no?

MS's other main argument (as far as I'm concerned) is that CARP would
reduce the number of files stored on multiple machines within the
cluster. This scheme would have the same effect.

[ _Just_ before sending this message, I took a look at the comments in
  the Squid 1.1.21 "squid.conf" config file, and it looks like a
  statement such as:

        cache_host mpls-1.cache.mr.net sibling 3128 3130 proxy-only

  ... would already do exactly what I'm suggesting ought to happen.
  If that's indeed true, then I feel slightly less foolish. :-)
]

An additional bonus of this ICP-based scheme is that it does _not_
rely on the clients implementing CARP. (*) This would be awfully
handy for keeping the effective cluster cache size "big" for:

        1. Browsers unable (or users unwilling) to use the Javascript
        autoconfiguration mechanism

        2. Proxies lower in your caching/proxy hierarchy which do not
        implement CARP.

For larger ISPs and corporate environments, #2 seems to me to be a
bigger win. We've got approximately 25 customers (1/4 of which are in
turn ISPs) with Squid, Harvest, Netscape, Microsoft, and various
firewall vendors' proxies using MRNet's hierarchy. I don't wanna have
to tell them, "Oh, our cluster of servers in Minneapolis has grown
from N to N+2. Please change your CARP configs appropriately." We
have enough problems with communication with customers for *really*
important changes. :-)

Negative features not addressed by this scheme:

        * Still have additional data on the cluster LAN being
          forwarded from the sibling with the file to the proxy
          which received the original request from the client. With
          a switched network environment, I'd hope this wouldn't be a
          big problem....

        * The forced expiration problem, mentioned in
          http://cache.is.co.za/squid/opt/performance.html. An
          administrator wishing to flush certain objects from the
          cache cluster as a whole would have to request a purge from
          each cluster member. This task could be easily automated
          and thus doesn't strike me as a big deal, either.

One potential positive thing (depending on how CARP is supposed to
behave when using persistent connections):

        * http://naragw.sharp.co.jp/sps/ Appendix A suggests that HTTP
          1.1 pipeline/persistent connections cause problems for a
          client-hashing scheme like CARP or Sharp's. The solution
          proposed in the URL above is to exclude the filename from
          the hash computation. I don't know how CARP would get
          around this problem, other than to open multiple persistent
          connections to different proxies. Otherwise, it seems to me
          you'd end up with file duplication within the cluster.

-Scott

(*) Backing up a step, it's my understanding that CARP would be used
in one of two ways (or mixed):

        1. Clients use CARP to figure out which proxy to ask for
        a particular file
        2. Clients talk to a box (or small cluster of boxes) which
        stands in front of the real cluster; these proxy-only boxes
        use CARP to figure out which machine in the real cache cluster
        to ask for a file, then simply transparently forward bits.

#2 would make the cache cluster's log files weird (you'd lost the
source address of the client), so #1 is probably more likely.
Received on Wed Apr 08 1998 - 21:44:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:39:38 MST