Re: Reliablity and multiple disks

From: Eric Stern <estern@dont-contact.us>
Date: Tue, 7 Jul 1998 15:42:23 -0400 (EDT)

On Tue, 7 Jul 1998 hamster@lspace.org wrote:

> Hi all,
>
> I'm busy specing up a new proxy/accelerator (It's nice to be able to
> plan my dream cache ;) Given that this is going to be an accelerator
> for a large number of customer virtual servers it has to be as
> resiliant to failure as possible (up to and including a disk crash in
> the cache).
>
> I've been reading the list and making frantic notes about raid and
> other factors affecting performance. We were planning a raid cache to
> start with to give the resiliance to handle disk crashes, however I've
> now been reading about multiple cache_dir entries.

Ok, I think this might be my fault, since I'm the one thats been dissing
RAID, so I'll take a stab at this.
Let me qualify my remarks about RAID, as I think I didn't make myself
clear. My comments about RAID vs seperate disks were inspired simply by
performance considerations. However, fault tolerance is a different issue.
I don't think anyone could deny that nothing beats a good RAID setup when
it comes to fault tolerance, especially if you do it "right" and get a
system with complete hot-swap and hot-rebuild capabilities. You can get a
near 100% uptime guarentee with that.

> For the moment let's assume the following
>
> --[ squid.conf ]--
> cache_dir /cache/disk1
> cache_dir /cache/disk2
> cache_dir /cache/disk3
> cache_dir /cache/disk4
> --[ squid.conf ]--
>
> Assuming one of these disks dies and goes off to whatever heaven disks
> go to what is going to happen to squid, will it handle the failure
> gracefully and keep going with a reduced cache or will it barf and
> fall over completely?

AFAIK, this depends on your OS. I'm most familiar with Linux, and I think
what would happen there is that a crashed HD would cause a SCSI bus error,
which would eventually lead to a kernel panic (ie the machine is dead).

I've read that other OS's can handle drive failures more gracefully. Given
a situation where the OS can detect and deal with a drive failure
(presumably simply by making the filesystems on that drive unavailable),
squid could *probably* survive. It would just get errors when trying to
read or write any cache files it had on that drive. When squid gets a read
error, it simply pulls the entry out of the cache and pretends it didn't
have it in the first place (ie it retrieves the object from the source).
When it gets a write error, it assumes the disk is full and shrinks the
size of that cache dir. Presumably it would get continuous write errors
until it had shrunk the cache dir size down to 0, when it would just stop
using it.

It would be trivial to patch squid to do something like "if you get x many
read/write errors on a cache_dir, just stop using it".

Of course, another option is to accept that a disk failure will bring down
the system, and install a backup cache server using a virtual IP fallover
setup. This could get expensive, but you're backup doesn't need to be
nearly as powerful as your main, as it just needs to hold you until you
get the main fixed.

/-----------------------------------------------------------------------/
/ Eric Stern - PacketStorm Technologies - (519) 837-0824 /
/ http://www.packetstorm.on.ca /
/ WebSpeed - a transparent web caching server - available now! /
/-----------------------------------------------------------------------/
Received on Tue Jul 07 1998 - 12:51:48 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:41:04 MST