RE: [squid-users] Sporadic high CPU usage, no traffic

From: Chris Robertson <crobertson@dont-contact.us>
Date: Mon, 1 Nov 2004 11:49:29 -0900

Yet more data:
Version 2.5.STABLE7 configure options: --bindir=/home/squid2/bin
--sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin
--datadir=/home/squid2/etc --sysconfdir=/etc/squid
--localstatedir=/home/squid2 --mandir=/usr/man --enable-ssl
--enable-err-languages=English --disable-ident-lookups
--enable-storeio=ufs,null --enable-snmp --disable-poll

Changed the cache_dir line to read :
cache_dir null /dev/null

Now the access.log (with full debugging) at high CPU usage shows:

2004/11/01 11:32:55| cbdataUnlock: 0x83f0560
2004/11/01 11:32:55| cbdataUnlock: Freeing 0x83f0560
2004/11/01 11:32:55| cbdataFree: 0x84dae78
2004/11/01 11:32:55| cbdataFree: Freeing 0x84dae78
2004/11/01 11:32:55| cbdataUnlock: 0x84dfa88
2004/11/01 11:32:55| cbdataUnlock: 0x88830e8
2004/11/01 11:32:55| cbdataFree: 0x83ef270
2004/11/01 11:32:55| cbdataFree: 0x83ef270 has 1 locks, not freeing
2004/11/01 11:32:55| cbdataUnlock: 0x83ef270
2004/11/01 11:32:55| cbdataUnlock: Freeing 0x83ef270
2004/11/01 11:33:08| eventRun: RUN ID 1342
2004/11/01 11:33:08| eventRun: Running 'idnsCheckQueue', id 1340
2004/11/01 11:33:08| comm_select: 54+0 FDs ready at 1099341188
2004/11/01 11:33:08| comm_select: FD 25 ready for reading
2004/11/01 11:33:08| httpReadReply: FD 25: len 0.
2004/11/01 11:33:08| fwdComplete: http://ps.galenaalaska.org/
        status 302
2004/11/01 11:33:08| fwdReforward: http://ps.galenaalaska.org/?
2004/11/01 11:33:08| fwdReforward: No, ENTRY_FWD_HDR_WAIT isn't set
2004/11/01 11:33:08| fwdComplete: not re-forwarding status 302

This is really killing me. Is it the eventRun that is causing the stall, is
it breaking the stall, or is it completely unrelated?

Thanks for your time,

Chris

-----Original Message-----
From: Chris Robertson [mailto:crobertson@gci.com]
Sent: Friday, October 29, 2004 4:40 PM
To: 'squid-users@squid-cache.org'
Subject: RE: [squid-users] Sporadic high CPU usage, no traffic

More data, same problem. I've made distclean, reconfigured (Squid Cache:
Version 2.5.STABLE7 configure options: --bindir=/home/squid2/bin
--sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin
--datadir=/home/squid2/etc --sysconfdir=/etc/squid
--localstatedir=/home/squid2 --mandir=/usr/man
--enable-err-languages=English --disable-ident-lookups
--enable-storeio=diskd,ufs --enable-snmp --disable-poll) to eliminate the
possibility of it being poor thread support in glibc, or polling in Linux
2.2 and still see the same problems. I've tried both diskd and ufs as
cache_dir types. I picked a fairly low traffic (2-4 req/sec & 10-15 KB/sec)
client and added debug_options ALL,9 to the conf file, and watched the CPU
usage.

At around 16:09:25 AKST October 29, 2004) I noticed a CPU spike. Here is an
excerpt from the access.log:

1099094958.155 2259 10.200.6.29 TCP_MISS/302 510 GET
http://a.tribalfusion.com/j.ad? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net
text/html
1099094958.155 862 10.200.6.29 TCP_MISS/302 0 GET
http://media72.fastclick.net/w/safepop.cgi? -
ROUNDROBIN_PARENT/proxy2.schoolaccess.net text/plain
1099094958.762 1468 10.200.6.6 TCP_MISS/200 42328 GET
http://scrapaddictinc.com/phpBB2/viewtopic.php? -
ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html
1099094958.882 750 10.200.6.29 TCP_MISS/200 14386 GET
http://a1444.g.akamai.net/7/1444/1104/1096471980/www.burstnet.com/gifs/oreck
_air8_728x90_092904.gif - ROUNDROBIN_PARENT/proxy1.schoolaccess.net
image/gif
1099094958.907 1043 10.200.22.39 TCP_MISS/200 4483 GET
http://xml.eshop.msn.com/xmlbuddy/eShopOffer.aspx? -
ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html
1099094958.921 446 10.200.6.29 TCP_MISS/200 1582 GET
http://a.tribalfusion.com/jr.ad? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net
application/x-javascript
1099094960.116 1961 10.200.6.29 TCP_MISS/200 568 GET
http://te.burstnet.com/tte/blank.gif? -
ROUNDROBIN_PARENT/proxy2.schoolaccess.net image/gif
1099094961.816 1788 10.200.22.39 TCP_MISS/200 2960 GET
http://di.image.eshop.msn.com/img/sinv/2773/2/08f341e.jpg -
ROUNDROBIN_PARENT/proxy2.schoolaccess.net image/jpeg
1099094961.820 2898 10.200.6.29 TCP_MISS/200 6788 GET
http://view.atdmt.com/TLA/iview/trblfask0070000058tla/direct;wi.120;hi.600/0
1/733113640? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html
1099094961.923 1806 10.200.6.6 TCP_MISS/200 1454 GET
http://scrapaddictinc.com/phpAdsNew/adjs.php? -
ROUNDROBIN_PARENT/proxy2.schoolaccess.net application/x-javascript
1099094962.015 10314 10.200.6.5 TCP_MISS/000 348 CONNECT login.icq.com:443
- ROUNDROBIN_PARENT/proxy2.schoolaccess.net -
1099094987.107 26991 10.200.6.6 TCP_MISS/000 0 GET
http://www.scrapbookingtop50.com/cgi-bin/counter.cgi? -
ROUNDROBIN_PARENT/proxy1.schoolaccess.net -
1099094987.107 98887 10.200.22.20 TCP_MISS/200 1025228 GET
http://motionslow.espn.go.com/persistent/ad_ebay_clock.wm9_548k.motion_persi
st.gif - ROUNDROBIN_PARENT/proxy1.schoolaccess.net image/gif
1099094987.107 68887 10.200.22.39 TCP_MISS/000 0 POST
http://ads.centralmedia.ws/requestimpression.aspx? -
ROUNDROBIN_PARENT/proxy1.schoolaccess.net -
1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET
http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif
1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET
http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif
1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET
http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif
1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET
http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif
1099094987.744 27717 10.200.22.39 TCP_MISS/200 6326 GET
http://context3.kanoodle.com/cgi-bin/context.cgi? -
ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html
1099094987.789 25701 10.200.6.29 TCP_MISS/200 20432 GET
http://spe.atdmt.com/ds/TBTLAASKJASK/coding_creative/120x600_top10_pics_1022
.swf? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net
application/x-shockwave-flash
1099094987.941 833 10.200.22.126 TCP_MISS/200 6743 GET
http://www.cbs.state.or.us/ins/docs/consumer/tomake.htm -
ROUNDROBIN_PARENT/proxy2.schoolaccess.net text/html

Notice the 25 second gap in traffic at 1099094962.015

Here's an excerpt from the cache.log:

2004/10/29 16:09:22| cbdataFree: 0x8ad9aa8
2004/10/29 16:09:22| cbdataFree: Freeing 0x8ad9aa8
2004/10/29 16:09:22| comm_select: FD 42 ready for writing
2004/10/29 16:09:22| sslWriteServer: FD 42, 232 bytes to write
2004/10/29 16:09:22| sslWriteServer: FD 42, 232 bytes written
2004/10/29 16:09:22| cbdataLock: 0x962d620
2004/10/29 16:09:22| cbdataValid: 0x962d620
2004/10/29 16:09:22| commSetSelect: FD 40 type 1
2004/10/29 16:09:22| commSetSelect: FD 42 type 1
2004/10/29 16:09:22| cbdataUnlock: 0x962d620
2004/10/29 16:09:47| eventRun: RUN ID 9626
2004/10/29 16:09:47| eventRun: Running 'MaintainSwapSpace', id 9623
2004/10/29 16:09:47| storeMaintainSwapSpace: f=0.000000, max_scan=100,
max_remove=10
2004/10/29 16:09:47| cbdataFree: 0x8a11610
2004/10/29 16:09:47| cbdataFree: Freeing 0x8a11610
2004/10/29 16:09:47| storeUfsDirMaintain: /home/squid2/cache removed 0/10
f=0.000 max_scan=100
2004/10/29 16:09:47| eventAdd: Adding 'MaintainSwapSpace', in 1.000000
seconds
2004/10/29 16:09:47| comm_select: 16+0 FDs ready at 1099094987
2004/10/29 16:09:47| checkTimeouts: FD 34 Expired
2004/10/29 16:09:47| checkTimeouts: FD 34: Call timeout handler

Again, there is a 25 second gap. Looking for at the first thing to be
logged after the gap, I see Running 'MaintainSwapSpace'. Searching the
cache.log for this string shows it usually runs without a hitch, but not
always. Prior to this particular gap, there is another:

2004/10/29 16:08:00| httpPconnTransferDone: FD 62
2004/10/29 16:08:00| commSetTimeout: FD 62 timeout 900
2004/10/29 16:08:00| commSetSelect: FD 62 type 1
2004/10/29 16:08:00| comm_select: FD 54 ready for writing
2004/10/29 16:08:00| commHandleWrite: FD 54: off 0, sz 353.
2004/10/29 16:08:00| commHandleWrite: write() returns 353
2004/10/29 16:08:00| cbdataValid: 0x8db6550
2004/10/29 16:08:00| httpSendComplete: FD 54: size 353: errflag 0.
2004/10/29 16:08:00| commSetTimeout: FD 54 timeout 900
2004/10/29 16:08:00| cbdataUnlock: 0x8db6550
2004/10/29 16:08:05| eventRun: RUN ID 9524
2004/10/29 16:08:05| eventRun: Running 'MaintainSwapSpace', id 9523
2004/10/29 16:08:05| storeMaintainSwapSpace: f=0.000000, max_scan=100,
max_remove=10
2004/10/29 16:08:05| cbdataFree: 0x8a11610
2004/10/29 16:08:05| cbdataFree: Freeing 0x8a11610
2004/10/29 16:08:05| storeUfsDirMaintain: /home/squid2/cache removed 0/10
f=0.000 max_scan=100
2004/10/29 16:08:05| eventAdd: Adding 'MaintainSwapSpace', in 1.000000
seconds
2004/10/29 16:08:05| comm_select: 15+0 FDs ready at 1099094885
2004/10/29 16:08:05| comm_select: FD 23 ready for reading
2004/10/29 16:08:05| clientReadRequest: FD 23: reading request...

Is this just coincidence? Am I way off base here? Would it help to make
these logs accessible?

Chris

-----Original Message-----
From: Chris Robertson [mailto:crobertson@gci.com]
Sent: Thursday, October 28, 2004 3:57 PM
To: 'squid-users@squid-cache.org'
Subject: [squid-users] Sporadic high CPU usage, no traffic

I'm at a loss. Here's the situation. We have a couple hundred squid
clients forced (using never-direct allow all) to run through a parent proxy
array. On occasion (no discernable pattern) the squid process on the
clients uses all the available CPU cycles, and tailing the access log shows
that nothing is being served. Attempting to use the cachemgr.cgi (from a
different box) or the squidclient (on the client proxy) returns no results
while the CPU usage is running wild. Once the CPU usage slows down, results
are returned. Running a squid -k debug (for the usual 2 seconds) and
tailing the cache.log seems to show acls being processed. Often in
conjunction with these CPU spikes I'll see "TCP connection to [parent-proxy]
failed", but not always. Sometimes these connection failed messages appear
in the cache log without a CPU spike.

There are two round-robin parent proxies, which peer (using ICP) and a third
"master" parent (also peering with ICP) which is running two squid processes
(one per CPU) due to some sites not handling "sessions" changing IPs. All
three are on the same network switch. They are being monitored with MRTG
(http://mrtg.schoolaccess.net/squid/) and never seem to slow down.

The connection failed messages don't favor one server over the other.

Most of the clients are accessing the internet over satellite, but some are
on terrestrial links. The problem does not occur more often with either
connection method.

I thought that I had found the problem when I read about the
half_closed_clients bug, but after patching, compiling and installing
SQUID2.5-STABLE7 on a couple of sites, the problem persists (though perhaps
less often). Disabling half-closed-clients in the squid.conf,
unsurprisingly had no noticeable effect.

The version I replaced for testing is SQUID2.5-STABLE4.

STABLE7 was compiled with --bindir=/home/squid2/bin
--sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin
--datadir=/home/squid2/etc --sysconfdir=/etc/squid
--localstatedir=/home/squid2 --mandir=/usr/man --enable-ssl
--enable-err-languages=English --disable-ident-lookups --with-pthreads
--enable-storeio=ufs,aufs,diskd --enable-snmp --enable-async-io --with-aio

STABLE4 was compiled with --bindir=/home/squid2/bin
--sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin
--datadir=/home/squid2/etc --sysconfdir=/etc/squid
--localstatedir=/home/squid2 --mandir=/usr/man --enable-xmalloc-statistics
--enable-useragent-log --enable-referer-log --enable-err-languages=English
--disable-ident-lookups --with-pthreads --enable-storeio=ufs,aufs,diskd

In the interest of preventing this message from becoming too long, I have
posted the conf files at http://mrtg.schoolaccess.net/squidconf/. The
-sparse files have all comments and blank lines stripped out.

What further tools can I use (on a Linux 2.2 kernel) to figure out what
squid is doing when the CPU usage spikes? What information did I leave out
that might be relevant?
Received on Mon Nov 01 2004 - 13:49:34 MST

This archive was generated by hypermail pre-2.1.9 : Wed Dec 01 2004 - 12:00:01 MST