[squid-users] cpu usage increases over time, squid performance declines

From: Mike Solomon <mas63@dont-contact.us>
Date: Tue, 14 Feb 2006 22:31:38 -0800

Hi all,

I've been having some problems with squid in reverse proxy mode. I
will preface this by saying that I realize I'm pushing squid to the
limits and I've been following the recent discussions on
performance. I'll try to be as clear as possible to expose the crux
of my problem, but please bear with me as the explanation is rather
long.

Hardware:
DualCore Opeteron 270, 1800MHz
8GB RAM
2x73GB 10k RPM SCSI drives (no raid)

Software:
RedHat Enterprise Linux 4 (64-bit) (kernel 2.6.9-22.0.2.ELsmp) (some
RedHat frankenkernel)
squid-2.5-patch12
added epoll patch (verified that epoll is getting used, had to tweak
the Makefile and a constant or two)
./configure --with-maxfd=32768 --disable-poll --disable-select --
enable-epoll --enable-async-io=48 --prefix=/usr/local/encap/
squid-2.5.12-64bit-epoll

I have two AUFS cache dirs:
cache_dir aufs /cache1 24000 8 256
cache_dir aufs /mnt/drive2/cache2 24000 8 256

I've overridden the refresh:
refresh_pattern . 525600 20% 525600 override-expire override-lastmod
ignore-reload

This machine basically only serves a huge set of images, which do not
change over time, so refresh isn't really relevant.

Ok, so now for the good news. I have about 17,000 active connections
(keep-alive is set to 30 seconds) doing 1700 requests/sec. Disk I/O
is negligible, as the cache is partitioned and all I/O is coming from
the page cache. There is cpu to spare - a vmstat over 60 seconds
reveal plenty of idle, 0 iowait, and an even split between user and
sys cpu time. All this yields about 43Mbps of traffic on a 100Mbps
NIC. Most files are ~4k.

This would be fantastic, but the machines "fall over" after several
hours. I have 4 machines, each configured identically. They last a
few hours - they slowly consume more and more cpu, all in user space,
until it starts affecting the median HTTP repsonse time. Then
throughput drops precipitously.

If I interrupt and start a new process, everything returns to normal
- much lower cpu load, response times drop, throughput returns. The
store log reveals nothing - just the usual RELEASE messages. Running
an strace on a good and bad machine simultaneously doesn't reveal
anything obviously bad. There aren't any "queue congestion"
messages, and I'm not seeing SO_FAIL messages (I upped the number of
threads to 48 after I saw those messages reported with an earlier
machine). Additionally, the cache manager general info doesn't show
any particularly distressed metric.

Basically, something in squid is chewing user-space cpu over time,
but I can't figure it out. I'm pretty new to squid, and I'm pretty
much stalled.

Turning down the persistent_request_timeout did not seem to have any
effect as I recall.
request_timeout 30 seconds
persistent_request_timeout 30 seconds

I've attached a bit more data in case that can help.

I'd really appreciate any advice - squid is by far the best thing
I've tried and if I can just make it stable, I'd be done. :-)

Thanks,

-Mike

Sample data from the cache manager:

File descriptor usage for squid:
        Maximum number of file descriptors: 32768
        Largest file desc currently in use: 18675
        Number of file desc currently in use: 17869
        Files queued for open: 4
        Available number of file descriptors: 14895
        Reserved number of file descriptors: 100
        Store Disk files open: 15
Internal Data Structures:
        885561 StoreEntries
          2916 StoreEntries with MemObjects
          2891 Hot Object Cache Items
        884513 on-disk objects

Results of cumulative strace:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
37.45 56.910200 13 4332990 3 futex
14.76 22.434863 11 1973577 epoll_ctl
14.30 21.724293 26 846843 550 write
13.02 19.786709 12 1642753 830428 read
11.45 17.393603 9 1920158 fcntl
   6.63 10.075447 12 829458 close
   1.15 1.751153 126 13919 epoll_wait
   0.82 1.244980 9 139200 accept
   0.30 0.461178 3 139200 getsockname
   0.08 0.114513 4 27840 gettimeofday
   0.01 0.019765 29 693 693 connect
   0.01 0.013483 19 693 socket
   0.01 0.008909 49 180 getdents64
   0.00 0.005713 13 430 stat
   0.00 0.005326 8 693 bind
   0.00 0.003456 5 693 getsockopt
   0.00 0.003367 5 693 setsockopt
   0.00 0.000936 20 48 open
   0.00 0.000347 3 103 lseek
   0.00 0.000241 5 48 fstat
   0.00 0.000110 22 5 brk
   0.00 0.000067 5 13 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 151.958659 11870230 831674 total
Received on Tue Feb 14 2006 - 23:31:46 MST

This archive was generated by hypermail pre-2.1.9 : Wed Mar 01 2006 - 12:00:03 MST