[squid-users] squid crashing consistently under high loads (2.4STABLE1)

From: Adam Woodbidge <adam@dont-contact.us>
Date: Mon, 7 May 2001 22:55:59 -0500

I'd like to bring the following issue to the attention of this mailing list.

I work for a large Canadian cable company that provides cable modem service
to over ten thousand customers. Over the last few months I've spent a
considerable amount of my time devoted to setting up a group of proxy
servers as part of a transition away from @Home's equipment. I decided to
use squid, mostly because of my experience with the product in the past, the
obvious cost benefit, and because I tend to lean toward open-source
solutions (we run Linux on most of our servers) if they are available.

The (identical) configuration of the three boxes is as follows:

- Intel Pentium III 833Mhz processor x 2
- Intel STL2 motherboard
- 1 GB memory
- Intel SRCU3-1 Ultra 160 LVD RAID controllers
- IBM DDYS-T18350M 18GB SCSI drives x 3 (set to RAID level 0, totaling
approx. 50GB cache storage each).

Operating system is Linux (RedHat 6.2) running kernel 2.4.4. Squid is
2.4STABLE1.

The boxes are setup in a round-robin fashion behind layer 4 switches (F5's
BigIP product), utilizing cache digests for inter-cache communication.

My problem is that, under high loads during peak times, one or more of the
boxes will essentially crash. I say "essentially" because while I can
achieve a 50% response ratio when pinging the box when this happens, I'm
unable to ssh/telnet into it. I can get so far as to receive a login banner
through telnet, but the login program never successfully executes a shell
after authentication.

During peak times, each box handles between 75 and 100 HTTP requests/second,
delivering up to 10Mb/s out its 100Mbps Ethernet interface.

Fortunately, our layer 4 switches have a health monitoring system so that
when one (or more) of these boxes go down, there is no customer impact;
requests are immediately re-directed to one or more of the proxy systems
that are still alive.

Initially thinking this was an issue with the RAID controller, both the
on-board memory (64MB) on the RAID card and the backplane that controls the
SCSI disks were replaced (the latter did actually fix a documented issue
whereby the backplane would mark disks as "offline" under high loads). But
neither has resolved the problem.

Mysteriously, squid doesn't report anything seriously wrong in cache.log
just before the box crashes. However, after just recently enabling kernel
console logging to the serial port, I saw the following error message
repeatedly displayed while the problem was happening on one of the boxes:

__alloc_pages: 1-order allocation failed.

If I understand this message correctly, the kernel as failed to allocate
memory. But how can this be when, even during peak usage, the squid process
uses only about 300MB? Each box has over 1GB of RAM in it, plus another
1.5GB in swap space (which I've never seen used)!

Some critical settings in my squid.conf file are:

cache_mem 64 MB
maximum_object_size 65536 KB
ipcache_size 2048
fqdncache_size 2048
cache_dir diskd /cache0 50000 16 256 (NOTE: this problem exists among ufs,
aufs and diskd storage types)

Aside from sibling settings and acl's, everything else is set to defaults.

A bit more information that might be of use is this snapshot of resource
usage as reported by CacheMgr:

Memory usage for squid via mallinfo():
        Total space in arena: 311321 KB
        Ordinary blocks: 306132 KB 5417 blks
        Small blocks: 0 KB 0 blks
        Holding blocks: 9676 KB 7 blks
        Free Small blocks: 0 KB
        Free Ordinary blocks: 5188 KB
        Total in use: 315808 KB 101%
        Total free: 5188 KB 2%
Memory accounted for:
        Total accounted: 238156 KB
        memPoolAlloc calls: 1487958379
        memPoolFree calls: 1481693539
File descriptor usage for squid:
        Maximum number of file descriptors: 24576
        Largest file desc currently in use: 1477
        Number of file desc currently in use: 721
        Files queued for open: 0
        Available number of file descriptors: 23855
        Reserved number of file descriptors: 100
        Store Disk files open: 0
Internal Data Structures:
        2006700 StoreEntries
          8993 StoreEntries with MemObjects
          8960 Hot Object Cache Items
        2005775 on-disk objects

page_faults = 0.000000/sec
select_loops = 294.824806/sec
select_fds = 457.364591/sec
average_select_fd_period = 0.001929/fd
median_select_fds = 0.000000
swap.outs = 9.779385/sec
swap.ins = 26.388342/sec

cpu_usage = 60.819511%

Any information or suggestions that could be provided towards resolving this
problem would be very much appreciated. I'd be happy to forward more
information if required.

Regards,

Adam Woodbridge
Received on Mon May 07 2001 - 21:55:49 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:59:52 MST