Re: [squid-users] Squid losing connectivity for 30 seconds

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Fri, 02 Dec 2011 16:15:46 +1300

On 2/12/2011 3:16 a.m., Elie Merhej wrote:
>
>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am currently facing a problem that I wasn't able to find a
>>>>>>>> solution for in the mailing list or on the internet,
>>>>>>>> My squid is dying for 30 seconds every one hour at the same
>>>>>>>> exact time, squid process will still be running,
>>>>>>>> I lose my wccp connectivity, the cache peers detect the squid
>>>>>>>> as a dead sibling, and the squid cannot server any requests
>>>>>>>> The network connectivity of the sever is not affected (a ping
>>>>>>>> to the squid's ip doesn't timeout)
>>>>>>>>
>>>>>>>> The problem doesn't start immediately when the squid is
>>>>>>>> installed on the server (The server is dedicated as a squid)
>>>>>>>> It starts when the cache directories starts to fill up,
>>>>>>>> I have started my setup with 10 cache directors, the squid will
>>>>>>>> start having the problem when the cache directories are above
>>>>>>>> 50% filled
>>>>>>>> when i change the number of cache directory (9,8,...) the squid
>>>>>>>> works for a while then the same problem
>>>>>>>> cache_dir aufs /cache1/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache2/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache3/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache4/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache5/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache6/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache7/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache8/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache9/squid 90000 140 256
>>>>>>>> cache_dir aufs /cache10/squid 80000 140 256
>>>>>>>>
>>>>>>>> I have 1 terabyte of storage
>>>>>>>> Finally I created two cache dircetories (One on each HDD) but
>>>>>>>> the problem persisted
>>>>>>>
>>>>>>> You have 2 HDD? but, but, you have 10 cache_dir.
>>>>>>> We repeatedly say "one cache_dir per disk" or similar. In
>>>>>>> particular one cache_dir per physical drive spindle (for "disks"
>>>>>>> made up of multiple physical spindles) wherever possible with
>>>>>>> physical drives/spindles mounting separately to ensure the
>>>>>>> pairing. Squid performs a very unusual pattern of disk I/O which
>>>>>>> stress them down to the hardware controller level and make this
>>>>>>> kind of detail critical for anything like good speed. Avoiding
>>>>>>> cache_dir object limitations by adding more UFS-based dirs to
>>>>>>> one disk does not improve the situation.
>>>>>>>
>>>>>>> That is a problem which will be affecting your Squid all the
>>>>>>> time though, possibly making the source of the pause worse.
>>>>>>>
>>>>>>> From teh description I believe it is garbage collection on the
>>>>>>> cache directories. The pauses can be visible when garbage
>>>>>>> collecting any caches over a few dozen GB. The squid default
>>>>>>> "swap_high" and "swap_low" values are "5" apart, with at minimum
>>>>>>> being a value of 0 apart. These are whole % points of the total
>>>>>>> cache size, being erased from disk in a somewhat random-access
>>>>>>> style across the cache area. I did mention uncommon disk I/O
>>>>>>> patterns, right?
>>>>>>>
>>>>>>> To be sure what it is, you can use the "strace" tool to the
>>>>>>> squid worker process (the second PID in current stable Squids)
>>>>>>> and see what is running. But given the hourly regularity and
>>>>>>> past experience with others on similar cache sizes, I'm almost
>>>>>>> certain its the garbage collection.
>>>>>>>
>>>>>>> Amos
>>>>>>>
>>>>>>
>>>>>> Hi Amos,
>>>>>>
>>>>>> Thank you for your fast reply,
>>>>>> I have 2 HDD (450GB and 600GB)
>>>>>> df -h displays that i have 357Gb and 505GB available
>>>>>> In my last test, my cache dir where:
>>>>>> cache_swap_low 90
>>>>>> cache_swap_high 95
>>>>>
>>>>> This is not. For anything more than 10-20 GB I recommend setting
>>>>> it to no more than 1 apart, possibly the same value if that works.
>>>>> Squid has a light but CPU-intensive and possibly long garbage
>>>>> removal cycle above cache_swap_low, and a much more aggressive but
>>>>> faster and less CPU intensive removal above cache_swap_high. On
>>>>> large caches it is better in terms of downtime going straight to
>>>>> the aggressive removal and clearing disk space fast, despite the
>>>>> bandwidth cost replacing any items the light removal would have left.
>>>>>
>>>>>
>>>>> Amos
>>>>>
>>>> Hi Amos,
>>>>
>>>> I have changed the swap_high 90 and swap_low 90 with two cache dir
>>>> (one for each HDD), i still have the same problem,
>>>> I did an strace (when the problem occured)
>>>> ------ ----------- ----------- --------- --------- ----------------
>>>> 23.06 0.004769 0 85681 96 write
>>>> 21.07 0.004359 0 24658 5 futex
>>>> 19.34 0.004001 800 5 open
>>>> 6.54 0.001352 0 5101 5101 connect
>>>> 6.46 0.001337 3 491 epoll_wait
>>>> 5.34 0.001104 0 51938 9453 read
>>>> 3.90 0.000806 0 39727 close
>>>> 3.54 0.000733 0 86400 epoll_ctl
>>>> 3.54 0.000732 0 32357 sendto
>>>> 2.02 0.000417 0 56721 recvmsg
>>>> 1.84 0.000381 0 24064 socket
>>>> 0.96 0.000199 0 56264 fcntl
>>>> 0.77 0.000159 0 6366 329 accept
>>>> 0.53 0.000109 0 24033 bind
>>>> 0.52 0.000108 0 30085 getsockname
>>>> 0.21 0.000044 0 11200 stat
>>>> 0.21 0.000044 0 6998 359 recvfrom
>>>> 0.09 0.000019 0 5085 getsockopt
>>>> 0.06 0.000012 0 2887 lseek
>>>> 0.00 0.000000 0 98 brk
>>>> 0.00 0.000000 0 16 dup2
>>>> 0.00 0.000000 0 10314 setsockopt
>>>> 0.00 0.000000 0 4 getdents
>>>> 0.00 0.000000 0 3 getrusage
>>>> ------ ----------- ----------- --------- --------- ----------------
>>>> 100.00 0.020685 560496 15343 total
>>>>
>>>> this is the strace of squid when it is working normally:
>>>> ------ ----------- ----------- --------- --------- ----------------
>>>> 24.88 0.015887 0 455793 169 write
>>>> 13.72 0.008764 0 112185 epoll_wait
>>>> 11.67 0.007454 0 256256 27158 read
>>>> 8.47 0.005408 0 169133 sendto
>>>> 6.94 0.004430 0 159596 close
>>>> 6.85 0.004373 0 387359 epoll_ctl
>>>> 6.42 0.004102 0 19651 19651 connect
>>>> 5.54 0.003538 0 290289 recvmsg
>>>> 3.81 0.002431 0 116515 socket
>>>> 3.53 0.002254 0 164750 futex
>>>> 1.68 0.001075 0 207688 fcntl
>>>> 1.53 0.000974 0 95228 23139 recvfrom
>>>> 1.29 0.000821 0 33408 12259 accept
>>>> 1.14 0.000726 0 46582 stat
>>>> 1.11 0.000707 0 110826 bind
>>>> 0.85 0.000544 0 137574 getsockname
>>>> 0.32 0.000204 0 21642 getsockopt
>>>> 0.26 0.000165 0 39502 setsockopt
>>>> 0.01 0.000007 0 8092 lseek
>>>> 0.00 0.000000 0 248 open
>>>> 0.00 0.000000 0 4 brk
>>>> 0.00 0.000000 0 88 dup2
>>>> 0.00 0.000000 0 14 getdents
>>>> 0.00 0.000000 0 6 getrusage
>>>> ------ ----------- ----------- --------- --------- ----------------
>>>> 100.00 0.063864 2832429 82376 total
>>>>
>>>> Do you have any suggestions to solve the issue, can I run the
>>>> garbage collector more frequently, is it better to change the
>>>> cache_dir type from aufs to something else?
>>>> Do you see the problem in the strace?
>>>>
>>>> Thank you,
>>>> Elie
>>>>
>>>>
>>> Hi,
>>>
>>> Please note that squid is facing the same problem even when their is
>>> no activity or any clients connected to it
>>>
>>> Regards
>>> Elie
>>>
>> Hi,
>>
>> here is the strace result
>> -----------------------------------------------------------------------------------------------------
>>
<snip looks perfectly normal traffic, file opening and closing data
reading, DNS lookups and other network read/writes>
>> read(165, "!", 256) = 1
<snip bunch of other normal traffic>

>> read(165, "!", 256) = 1
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Squid is freezing at this point

The 1-byte read on FD #165 seems odd. Particularly suspicious being just
before a pause and only having a constant 256 byte buffer space
available. No ideas what it is yet though.

>
> Here is my compilation options
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
> ./configure '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin'
> '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share'
> '--includedir=/usr/include' '--libdir=/usr/lib64'
> '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib'
> '--mandir=/usr/share/man' '--infodir=/usr/share/info'
> '--exec_prefix=/usr' '--libexecdir=/usr/lib64/squid'
> '--localstatedir=/var' '--datadir=/usr/share/squid'
> '--sysconfdir=/etc/squid' '--with-logdir=/var/log/squid'
> '--with-pidfile=/var/run/squid.pid' '--disable-dependency-tracking'
> '--enable-arp-acl' '--enable-follow-x-forwarded-for'
> '--enable-auth=basic,digest,negotiate'
> '--enable-external-acl-helpers=ip_user,unix_group,wbinfo_group'
> '--enable-cache-digests' '--enable-cachemgr-hostname=localhost'
> '--enable-delay-pools' '--enable-epoll' '--enable-icap-client'
> '--enable-ident-lookups' '--enable-linux-netfilter'
> '--enable-referer-log' '--enable-removal-policies=lru' '--enable-snmp'
> '--enable-ssl' '--enable-storeio=aufs,ufs' '--enable-wccpv2'
> '--enable-esi' '--with-aio' '--with-default-user=proxy'
> '--with-filedescriptors=65536' '--with-dl' '--with-pthreads'
> '--with-libcap' '--with-netfilter-conntrack' '--with-openssl'
> '--enable-inline' '--enable-uselect' '--enable-disk-io'
> '--disable-htcp' '--with-gnu-ld' '--with-build-environment=default'
> '--enable-carp' '--enable-async-io=26' --with-squid=/home/squid-3.1.15
> --enable-ltdl-convenience
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Here is my squid.conf:

Can't help myself, I digress into a config audit... completely off-topic.

> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
> acl manager proto cache_object
> acl localhost src 127.0.0.1/32
> acl to_localhost dst 127.0.0.0/8 0.0.0.0/32
> acl localnet src 10.0.0.0/8 # RFC1918 possible internal network
> acl localnet src 172.16.0.0/12 # RFC1918 possible internal network
> acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
> acl SSL_ports port 443
> acl Safe_ports port 80 # http
> acl Safe_ports port 21 # ftp
> acl Safe_ports port 443 # https
> acl Safe_ports port 70 # gopher
> acl Safe_ports port 210 # wais
> acl Safe_ports port 1025-65535 # unregistered ports
> acl Safe_ports port 280 # http-mgmt
> acl Safe_ports port 488 # gss-http
> acl Safe_ports port 591 # filemaker
> acl Safe_ports port 777 # multiling http
> acl CONNECT method CONNECT
> acl clients src x.x.x.x
> #icp acl
> acl squidFarm src x.x.x.x
> acl self src x.x.x.x
>
> #ICAP acl
> acl icap_port1 myportname 3144
> acl icap_port2 myportname 3145
> acl icap_port3 myportname 3146
> acl icap_port5 myportname 3148
>
> http_access allow manager localhost
> http_access deny manager
> http_access deny !Safe_ports
> http_access deny CONNECT !SSL_ports
> http_access allow localnet
> http_access allow localhost
> #prevent digest loop
> http_access deny self
> http_access allow clients
> http_access deny all
> http_reply_access allow all
> #icp_access allow all
> icp_port 3130
> icp_access allow squidFarm
> icp_access deny all
>
> http_port 3129 tproxy
> http_port 3128 transparent
> http_port 3144 tproxy
> http_port 3146 tproxy
> http_port 3145 tproxy
> http_port 3148 tproxy
>
> forwarded_for off
> via off
> visible_hostname x.x.x.x
> hierarchy_stoplist cgi-bin ?
> coredump_dir /var/spool/squid
> # Image files
> refresh_pattern -i \.png$ 10080 90% 43200
> refresh_pattern -i \.gif$ 10080 90% 43200
> refresh_pattern -i \.jpg$ 10080 90% 43200
> refresh_pattern -i \.jpeg$ 10080 90% 43200
> refresh_pattern -i \.bmp$ 10080 90% 43200
> refresh_pattern -i \.tif$ 10080 90% 43200
> refresh_pattern -i \.tiff$ 10080 90% 43200

This is *the* most inefficient way to do this. The refresh_pattern set
is tested for every single cached object load. On top of that each
pattern line is an individual regex pettern, which is almost the slowest
match type Squid can perform.
You will gain in proxy performance by collapsing these regex patterns
down into one line. Like so:

    refresh_pattern -i \.(png|gif|jpe?g|bmp|tiff?)$ 10080 90% 43200

same for the others...
>
> # Compressed files
> refresh_pattern -i \.zip$ 10080 90% 43200
> refresh_pattern -i \.rar$ 10080 90% 43200
> refresh_pattern -i \.tar$ 10080 90% 43200
> refresh_pattern -i \.gz$ 10080 90% 43200
> refresh_pattern -i \.tgz$ 10080 90% 43200
> refresh_pattern -i \.z$ 10080 90% 43200
> refresh_pattern -i \.arj$ 10080 90% 43200
> refresh_pattern -i \.lha$ 10080 90% 43200
> refresh_pattern -i \.lzh$ 10080 90% 43200
>
> # Binary files
> refresh_pattern -i \.exe$ 10080 90% 43200
> refresh_pattern -i \.msi$ 10080 90% 43200
>
> # Multimedia files
> refresh_pattern -i \.mp3$ 10080 90% 43200
> refresh_pattern -i \.wav$ 10080 90% 43200
> refresh_pattern -i \.mid$ 10080 90% 43200
> refresh_pattern -i \.midi$ 10080 90% 43200
> refresh_pattern -i \.ram$ 10080 90% 43200
> refresh_pattern -i \.ra$ 10080 90% 43200
> refresh_pattern -i \.mov$ 10080 90% 43200
> refresh_pattern -i \.avi$ 10080 90% 43200
> refresh_pattern -i \.wmv$ 10080 90% 43200
> refresh_pattern -i \.mpg$ 10080 90% 43200
> refresh_pattern -i \.mpg$ 10080 90% 43200
> refresh_pattern -i \.mpg$ 10080 90% 43200
> refresh_pattern -i \.mpeg$ 10080 90% 43200
> refresh_pattern -i \.swf$ 10080 90% 43200
>
> # Document files
> refresh_pattern -i \.pdf$ 10080 90% 43200
> refresh_pattern -i \.ps$ 10080 90% 43200
> refresh_pattern -i \.doc$ 10080 90% 43200
> refresh_pattern -i \.ppt$ 10080 90% 43200
> refresh_pattern -i \.pps$ 10080 90% 43200
> #windows update refresh paterns
> refresh_pattern windowsupdate.com/.*\.(cab|exe|psf) 4320 100% 43200
> reload-into-ims
> refresh_pattern download.microsoft.com/.*\.(cab|exe|psf) 4320 100%
> 43200 reload-into-ims
> refresh_pattern armdl.adobe.com/.*\.(cab|msp|msi) 4320 100% 43200
> reload-into-ims
> #default refresh paterns
> refresh_pattern ^ftp: 1440 20% 10080
> refresh_pattern ^gopher: 1440 0% 1440
> refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
> refresh_pattern . 0 20% 4320
>
> client_db on
> cache_mem 256 MB
> cache_swap_low 90
> cache_swap_high 90
> maximum_object_size 512 MB
> maximum_object_size_in_memory 20 KB
> cache_dir aufs /cache1/squid 320000 480 256
> cache_dir aufs /cache2/squid 480000 700 256
>
> #logformat modified %tl Request time:%tr Status:%Ss/%03>Hs Client:%>a
> URL:%ru Server:%<A Type:%mt
> logformat modified %tl %>a %ru %<A
> access_log /var/log/squid/access.log modified
> cache_log /var/log/squid/cache.log
> cache_store_log none
> coredump_dir /var/spool/squid
> half_closed_clients off
>
> snmp_port x
> acl cacti src x.x.x.x
> acl snmpcommunity snmp_community xxxx
> snmp_access allow snmpcommunity xxxx
> snmp_access allow snmpcommunity localhost
> snmp_access deny all
>
> wccp2_router x.x.x.x
> wccp2_forwarding_method l2
> wccp2_return_method l2
> wccp2_service dynamic x
> wccp2_service_info x protocol=tcp flags=src_ip_hash priority=240 ports=80
> wccp2_service dynamic x
> wccp2_service_info x protocol=tcp flags=dst_ip_hash,ports_source
> priority=240 ports=80
> wccp2_assignment_method mask
>
>
> #icp configuration
> maximum_icp_query_timeout 30
> cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
> cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
> cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
> log_icp_queries off
> miss_access allow squidFarm
> miss_access deny all

So if I understand this right. You have a layer of proxies defined as
"squidFarm" which client traffic MUST pass through *first* before they
are allowed to fetch MISS requests from this proxy. Yet you are
receiving WCCP traffic directly at this proxy with both NAT and TPROXY?

This miss_access policy seems decidedly odd. Perhapse you can enlighten me?

>
>
>
> # ICAP configuration
> icap_enable on
> icap_send_client_ip on
> icap_send_client_username on
> icap_client_username_encode off
> icap_client_username_header X-Client-Username
> icap_preview_enable on
> icap_preview_size 1024
>
>
>
> logformat squid %tl %icap::tt %icap::tr %>a %icap::rm %icap::ru
> icap_log /var/log/squid/icap.log squid
> icap_service service_req reqmod_precache bypass=1 icap://x.x.x.x:x/reqmod
> adaptation_access service_req allow icap_port1
> icap_service service_req_2 reqmod_precache bypass=1
> icap://x.x.x.x:x/reqmod
> adaptation_access service_req_2 allow icap_port2
> icap_service service_req_3 reqmod_precache bypass=1
> icap://x.x.x.x:x/reqmod
> adaptation_access service_req_3 allow icap_port3
> icap_service service_req_5 reqmod_precache bypass=1
> icap://x.x.x.x:x/reqmod
> adaptation_access service_req_5 allow icap_port5
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Please advise,
> Thank you
> Elie
Received on Fri Dec 02 2011 - 03:16:01 MST

This archive was generated by hypermail 2.2.0 : Fri Dec 02 2011 - 12:00:01 MST