Re: [squid-users] squid consuming near all (95+ %) CPU, it is normal? from Amos Jeffries on 2012-11-19 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 20 Nov 2012 11:19:28 +1300

On 20.11.2012 09:33, Frantisek Hanzlik wrote:
> Amos Jeffries wrote:
>> On 12.11.2012 03:25, Frantisek Hanzlik wrote:
>>> With this squid configuration:
>>> acl localnet src 172.16.0.0/12
>>> acl localnet src 192.168.0.0/16
>>> acl SSL_ports port 443
>>> acl SSL_ports port 85
>>> acl SSL_ports port 81
>>> acl SSL_ports port 5443
>>> acl Safe_ports port 80
>>> acl Safe_ports port 21
>>> acl Safe_ports port 443
>>> acl Safe_ports port 70
>>> acl Safe_ports port 210
>>> acl Safe_ports port 1025-65535
>>> acl Safe_ports port 280
>>> acl Safe_ports port 488
>>> acl Safe_ports port 591
>>> acl Safe_ports port 777
>>> acl Safe_ports port 5443
>>> acl Safe_ports port 85
>>> acl Safe_ports port 81
>>> acl CONNECT method CONNECT
>>> http_access allow manager localhost
>>> http_access deny manager
>>> http_access deny !Safe_ports
>>> http_access deny CONNECT !SSL_ports
>>> http_access allow localnet
>>> http_access allow localhost
>>> http_access deny all
>>> http_port 3128
>>> hierarchy_stoplist cgi-bin ?
>>> cache_dir ufs /var/spool/squid 1000 16 256 max-size=999000
>>> cache_mem 512 MB
>>> maximum_object_size 4096 KB
>>> memory_pools off
>>> cache_swap_low 90
>>> cache_swap_high 95
>>> dns_nameservers 172.16.1.1
>>> client_db off
>>> half_closed_clients off
>>> max_filedesc 4096
>>> coredump_dir /var/spool/squid
>>> refresh_pattern ^ftp: 1440 20% 10080
>>> refresh_pattern ^gopher: 1440 0% 1440
>>> refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
>>> refresh_pattern . 0 20% 4320
>>> acl users src 172.31.0.0/16
>>> delay_pools 1
>>> delay_class 1 2
>>> delay_parameters 1 5000000/10000000 5000000/10000000
>>> delay_access 1 allow users
>>>
>>> squid very often load CPU at near 100%, with cca 200 users and 4000
>>> connections (~2000 to users, 2000 to internet). Removing delay pool
>>> configuration has no big effect.
>>
>> Users and connections is meaningless in HTTP. Requests per second
>> flowing
>> over those connections is what counts.
>>
>> The proxy might have all 4000 links idle (low CPU; zero bandwidth;
>> zero
>> disk I/O), be downloading video (or .iso) images simultaneously (low
>> CPU;
>> maxed out bandwidth; high disk I/O), or parsing and processing
>> header-only requests (100% CPU; moderate or low bandwidth; no disk
>> I/O).
>>
>> NP: 3.2 uses HTTP/1.1. A lot of the protocol performance features in
>> HTTP/1.1 are done by removing object bodies and reducing the
>> transactions
>> to header-only requests and responses.
>>
>>
>>> HW configuration: Dual core E8500_at_3.16GHz CPU, 4GB RAM, 2x SATA 7k2
>>> Raid Edition disks in SW RAID1 for squid cache (disk performance
>>> seems isn't problem, IOWAIT is small), gigabit ethernet cards to
>>> internet (~800 Mbps line) and to LAN.
>>> It is squid-3.2.3.20121106.r11695-1.fc14.i686 on Fedora 14 i686
>>> (I test it with some older squid 3.1 version and same configuration
>>> too, but results were same, or rather worse)
>>>
>>> It this CPU load normal, or can be there done some performance
>>> tunnning for it?
>>
>> In order of likelyhood:
>>
>> Experiment #1 is remove that SW RAID and test again.
>> Sure iowait is not bad (um, to *master* disk it is same as accessing
>> without RAID at all), however iowait is only half the story with SW
>> RAID.
>> Being SW every I/O op sucks away some ... CPU. Disk I/O CPU loading
>> in
>> particular can be doubled depending on the implementations buffering
>> efficiency.
>> If you need RAID at all use HW RAID instead of SW. The only benefit
>> you
>> get from RAID under Squid is some advance notice of disks failing
>> before
>> the proxy crashes (or starts TCP_SWAPFAIL_MISS'ing - but UFS cache
>> type
>> still crashes so maybe you do need RAID).
>> By using SW RAID in particular you are taking CPU cycles away from
>> Squid, which would otherwise be using them to process a higher
>> req/sec
>> peak capacity. If your peak traffic is low req/sec this is not a
>> problem,
>> but for several thousand users I expect your peak capacity needs are
>> high.
>> Choice is yours, but I seriously advise moving away from SW RAID.
>>
>>
>> Experiment #2 is to set "memory_pools on".
>> Up to 90% of malloc/free calls are for very short strings and small
>> objects. Squid can save on a lot of system allocator cycles and
>> shrink
>> the overall system RAM requirements a little bit by allocating these
>> in
>> batches/blocks/pools. This will help speed up req/sec capacity in
>> the
>> traffic which consists of mostly HTTP headers.
>>
>>
>> Experiment #3 is to raise max_filedesc
>> I suspect this setting is the only thing limiting your proxy to 2000
>> user
>> connections and 2000 internet connections. Notice that makes 4000,
>> and
>> with 100 connections held in reserve for unexpected disk access it
>> would
>> seem that you are not using disks at all for many of these
>> connections
>> (TCP_MISS and TCP_MEM_HIT being most of your load?).
>> When there are not enough FD to service all incoming requests Squid
>> starts limiting them and spends extra CPU cycles doing management of
>> the
>> waiting client connection queue.
>>
>>
>> Experiment #4 is to set "cache_swap_low 97" and "cache_swap_high
>> 98".
>> Periodic garbage collection happens when the cache fills up. With a
>> 1GB
>> disk cache the default settings make about 50-100MB of objects being
>> processed in the cache and erased from disk. Most objects will be
>> only a
>> few KB - see your avg object size stats.
>> NOTE: this may only marginally appear in iowait, but shows up
>> better in
>> the related erase/unlink operation stats.
>>
>> Amos
>
> Thank You for recommendations. I think #3 perhaps isn't problem
> there,
> as squids cache.log says there is 16384 filedescriptors (i do not
> understand why it is this number, in startup script i'm setting
> 'ulimit -HSn 65536'
> just before starting squid - then why no. of fd is only 16384? Some
> other
> system limitations?):
>
> 2012/11/17 08:40:29 kid1| Starting Squid Cache version
> 3.2.3-20121106-r11695 for i686-pc-linux-gnu...
> 2012/11/17 08:40:29 kid1| Process ID 2497
> 2012/11/17 08:40:29 kid1| Process Roles: worker
> 2012/11/17 08:40:29 kid1| With 16384 file descriptors available
> 2012/11/17 08:40:29 kid1| Initializing IP Cache...
> 2012/11/17 08:40:29 kid1| DNS Socket created at 0.0.0.0, FD 9
> 2012/11/17 08:40:29 kid1| Adding nameserver 172.16.1.1 from
> squid.conf
> 2012/11/17 08:40:29 kid1| Logfile: opening log
> daemon:/var/log/squid/access.log
> 2012/11/17 08:40:29 kid1| Logfile Daemon: opening log
> /var/log/squid/access.log
> 2012/11/17 08:40:29 kid1| Unlinkd pipe opened on FD 15
> 2012/11/17 08:40:29 kid1| Local cache digest enabled; rebuild/rewrite
> every 3600/3600 sec
> 2012/11/17 08:40:29 kid1| Store logging disabled
> 2012/11/17 08:40:29 kid1| Swap maxSize 1024000 + 524288 KB, estimated
> 119099 objects
> 2012/11/17 08:40:29 kid1| Target number of buckets: 5954
> 2012/11/17 08:40:29 kid1| Using 8192 Store buckets
> 2012/11/17 08:40:29 kid1| Max Mem size: 524288 KB
> 2012/11/17 08:40:29 kid1| Max Swap size: 1024000 KB
> 2012/11/17 08:40:29 kid1| Rebuilding storage in /var/spool/squid
> (dirty log)
> 2012/11/17 08:40:29 kid1| Using Least Load store dir selection
> 2012/11/17 08:40:29 kid1| Set Current Directory to /var/spool/squid
> 2012/11/17 08:40:30 kid1| Loaded Icons.
> 2012/11/17 08:40:30 kid1| HTCP Disabled.
> 2012/11/17 08:40:30 kid1| Squid plugin modules loaded: 0
> 2012/11/17 08:40:30 kid1| Adaptation support is off.
> 2012/11/17 08:40:30 kid1| Accepting HTTP Socket connections at
> local=0.0.0.0:3128 remote=[::] FD 18 flags=9
> 2012/11/17 08:40:30 kid1| Store rebuilding is 0.11% complete
> 2012/11/17 08:40:45 kid1| Store rebuilding is 40.44% complete
> 2012/11/17 08:41:00 kid1| Store rebuilding is 83.15% complete
> 2012/11/17 08:41:05 kid1| Done reading /var/spool/squid swaplog
> (3699238
> entries)
> 2012/11/17 08:41:05 kid1| Finished rebuilding storage from disk.
> 2012/11/17 08:41:05 kid1| 1877119 Entries scanned
> 2012/11/17 08:41:05 kid1| 6 Invalid entries.
> 2012/11/17 08:41:05 kid1| 0 With invalid flags.
> 2012/11/17 08:41:05 kid1| 54901 Objects loaded.
> 2012/11/17 08:41:05 kid1| 0 Objects expired.
> 2012/11/17 08:41:05 kid1| 1821036 Objects cancelled.
> 2012/11/17 08:41:05 kid1| 53 Duplicate URLs purged.
> 2012/11/17 08:41:05 kid1| 1129 Swapfile clashes avoided.
> 2012/11/17 08:41:05 kid1| Took 35.95 seconds (1527.34 objects/sec).
> 2012/11/17 08:41:05 kid1| Beginning Validation Procedure
> 2012/11/17 08:41:06 kid1| Completed Validation Procedure
> 2012/11/17 08:41:06 kid1| Validated 54895 Entries
> 2012/11/17 08:41:06 kid1| store_swap_size = 922068.00 KB
> 2012/11/17 08:41:06 kid1| storeLateRelease: released 6 objects
>
>
>
> Thus I started with recommendation #2: setting 'memory_pools on'.
> This seemed to me firstly as it highly helped: CPU load was dropped
> to approx. 40-50% at the time of highest load (this is only between
> 18 to 21 hours our time /UTC+1/, in all other time is squid load
> much smaller and CPU load was reasonable). But then I noticed
> that '(squid-1)' process (PID 14667 below, child of 11769) changes
> its PID, as if died and parent started new copy:
>
> 1 11769 11769 11769 ? -1 Ss 0 0:00 squid -f
> /etc/squid/squid.conf
> 11769 14667 11769 11769 ? -1 R 23 0:11 (squid-1) -f
> /etc/squid/squid.conf
> 14667 14670 11769 11769 ? -1 S 23 0:00 (logfile-daemon)
> /var/log/squid/access.log
>
> And then I found in /var/log/messages these lines
> (/var/log/squid/cache.log contained nothing suspicious(*) ):
>
> Nov 19 16:53:40 172 squid[2834]: Squid Parent: (squid-1) process
> 32702
> exited due to signal 6 with status 0

Ouch. That needs to be found and fixed.
Are you able to assist with that by running Squid under a debugger and
getting a backtrace?
(http://wiki.squid-cache.org/SquidFaq/BugReporting#Using_gdb_debugger_on_a_live_proxy_.28with_minimal_downtime.29)

<snip repeats>
> Nov 19 19:14:19 172 squid[2834]: Squid Parent: (squid-1) process 8804
> will not be restarted due to repeated, frequent failures
> Nov 19 19:14:19 172 squid[2834]: Exiting due to repeated, frequent
> failures
>
> At this time exited parent squid process as well and it is necessary
> manually restart squid. Note this complete exit occurs after seven
> days
> of working. And before these 7 days squid crashed with this message
> in
> "/var/log/message":
>
> Nov 12 20:14:38 172 kernel: [9002060.957453] squid[2494]: segfault at
> e51
> ip 082d4917 sp bfc7dbc0 error 6 in squid[8048000+460000]
> Nov 12 20:14:38 172 squid[32322]: Squid Parent: (squid-1) process
> 2494
> exited due to signal 11 with status 0
>
> Signal 6 is SIGABRT, how it can appear here?
>
> (*) /var/log/squid/cache.log contains several messages like this, I
> not
> know if they are important:
<snip>
> 2012/11/19 20:55:52 kid1| Failed to select source for
>
> 'http://u1353354961016s928369347.r6.td.ipv6test.semnicneposilejte.cz/1x1.png?u1353354961016s928369347.r6.td'
> 2012/11/19 20:55:52 kid1| always_direct = 0
> 2012/11/19 20:55:52 kid1| never_direct = 0
> 2012/11/19 20:55:52 kid1| timedout = 0

--> IPv6-only domain

> 2012/11/19 20:55:52 kid1| Failed to select source for
> 'http://pages.etology.com/crossdomain.xml'
> 2012/11/19 20:55:52 kid1| always_direct = 0
> 2012/11/19 20:55:52 kid1| never_direct = 0
> 2012/11/19 20:55:52 kid1| timedout = 0

--> NXDOMAIN

> 2012/11/19 20:55:53 kid1| ipcacheParse: No Address records in
> response to
> 'ipv6.msftncsi.com'

--> IPv6-only domain

You can ignore *most* of these for now, we have a bug open and being
worked on to clean up the unnecessary logging from normal DNS failures
and make any important ones stand out better.

>
> What now?
> Should I try newer version then that just tested
> 3.2.3.20121106.r11695 ?
> Or I have some wrong settings?
> I'm not using IPv6 address on any interface and ipv6.ko kernel module
> isn't loaded (it is Fedora 14 i686, which still has ipv6 as module;
> later Fedora versions have ipv6 compiled in kernel). It is OK?
> (squid has not any command-line switches for disabling IPv6, right?)

Squid auto-detects IPv6 on startup. That seems to be working fine. It
has detected the lack in your kernel already and the DNS results above
are what happens when IPv6-only domains are asked for by your clients -
they get an error page back instead of the object requested.

Amos
Received on Mon Nov 19 2012 - 22:19:33 MST

This archive was generated by hypermail 2.2.0 : Wed Nov 21 2012 - 12:00:04 MST