Re: [squid-users] Squid losing connectivity for 30 seconds from Elie Merhej on 2011-11-28 (squid-users)

From: Elie Merhej <emerhej_at_wise.net.lb>
Date: Mon, 28 Nov 2011 11:17:42 +0200

>>>> Hi,
>>>>
>>>> I am currently facing a problem that I wasn't able to find a
>>>> solution for in the mailing list or on the internet,
>>>> My squid is dying for 30 seconds every one hour at the same exact
>>>> time, squid process will still be running,
>>>> I lose my wccp connectivity, the cache peers detect the squid as a
>>>> dead sibling, and the squid cannot server any requests
>>>> The network connectivity of the sever is not affected (a ping to
>>>> the squid's ip doesn't timeout)
>>>>
>>>> The problem doesn't start immediately when the squid is installed
>>>> on the server (The server is dedicated as a squid)
>>>> It starts when the cache directories starts to fill up,
>>>> I have started my setup with 10 cache directors, the squid will
>>>> start having the problem when the cache directories are above 50%
>>>> filled
>>>> when i change the number of cache directory (9,8,...) the squid
>>>> works for a while then the same problem
>>>> cache_dir aufs /cache1/squid 90000 140 256
>>>> cache_dir aufs /cache2/squid 90000 140 256
>>>> cache_dir aufs /cache3/squid 90000 140 256
>>>> cache_dir aufs /cache4/squid 90000 140 256
>>>> cache_dir aufs /cache5/squid 90000 140 256
>>>> cache_dir aufs /cache6/squid 90000 140 256
>>>> cache_dir aufs /cache7/squid 90000 140 256
>>>> cache_dir aufs /cache8/squid 90000 140 256
>>>> cache_dir aufs /cache9/squid 90000 140 256
>>>> cache_dir aufs /cache10/squid 80000 140 256
>>>>
>>>> I have 1 terabyte of storage
>>>> Finally I created two cache dircetories (One on each HDD) but the
>>>> problem persisted
>>>
>>> You have 2 HDD? but, but, you have 10 cache_dir.
>>> We repeatedly say "one cache_dir per disk" or similar. In
>>> particular one cache_dir per physical drive spindle (for "disks"
>>> made up of multiple physical spindles) wherever possible with
>>> physical drives/spindles mounting separately to ensure the pairing.
>>> Squid performs a very unusual pattern of disk I/O which stress them
>>> down to the hardware controller level and make this kind of detail
>>> critical for anything like good speed. Avoiding cache_dir object
>>> limitations by adding more UFS-based dirs to one disk does not
>>> improve the situation.
>>>
>>> That is a problem which will be affecting your Squid all the time
>>> though, possibly making the source of the pause worse.
>>>
>>> From teh description I believe it is garbage collection on the cache
>>> directories. The pauses can be visible when garbage collecting any
>>> caches over a few dozen GB. The squid default "swap_high" and
>>> "swap_low" values are "5" apart, with at minimum being a value of 0
>>> apart. These are whole % points of the total cache size, being
>>> erased from disk in a somewhat random-access style across the cache
>>> area. I did mention uncommon disk I/O patterns, right?
>>>
>>> To be sure what it is, you can use the "strace" tool to the squid
>>> worker process (the second PID in current stable Squids) and see
>>> what is running. But given the hourly regularity and past experience
>>> with others on similar cache sizes, I'm almost certain its the
>>> garbage collection.
>>>
>>> Amos
>>>
>>
>> Hi Amos,
>>
>> Thank you for your fast reply,
>> I have 2 HDD (450GB and 600GB)
>> df -h displays that i have 357Gb and 505GB available
>> In my last test, my cache dir where:
>> cache_swap_low 90
>> cache_swap_high 95
>
> This is not. For anything more than 10-20 GB I recommend setting it to
> no more than 1 apart, possibly the same value if that works.
> Squid has a light but CPU-intensive and possibly long garbage removal
> cycle above cache_swap_low, and a much more aggressive but faster and
> less CPU intensive removal above cache_swap_high. On large caches it
> is better in terms of downtime going straight to the aggressive
> removal and clearing disk space fast, despite the bandwidth cost
> replacing any items the light removal would have left.
>
>
> Amos
>
Hi Amos,

I have changed the swap_high 90 and swap_low 90 with two cache dir (one
for each HDD), i still have the same problem,
I did an strace (when the problem occured)
------ ----------- ----------- --------- --------- ----------------
  23.06 0.004769 0 85681 96 write
  21.07 0.004359 0 24658 5 futex
  19.34 0.004001 800 5 open
   6.54 0.001352 0 5101 5101 connect
   6.46 0.001337 3 491 epoll_wait
   5.34 0.001104 0 51938 9453 read
   3.90 0.000806 0 39727 close
   3.54 0.000733 0 86400 epoll_ctl
   3.54 0.000732 0 32357 sendto
   2.02 0.000417 0 56721 recvmsg
   1.84 0.000381 0 24064 socket
   0.96 0.000199 0 56264 fcntl
   0.77 0.000159 0 6366 329 accept
   0.53 0.000109 0 24033 bind
   0.52 0.000108 0 30085 getsockname
   0.21 0.000044 0 11200 stat
   0.21 0.000044 0 6998 359 recvfrom
   0.09 0.000019 0 5085 getsockopt
   0.06 0.000012 0 2887 lseek
   0.00 0.000000 0 98 brk
   0.00 0.000000 0 16 dup2
   0.00 0.000000 0 10314 setsockopt
   0.00 0.000000 0 4 getdents
   0.00 0.000000 0 3 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 0.020685 560496 15343 total

this is the strace of squid when it is working normally:
------ ----------- ----------- --------- --------- ----------------
  24.88 0.015887 0 455793 169 write
  13.72 0.008764 0 112185 epoll_wait
  11.67 0.007454 0 256256 27158 read
   8.47 0.005408 0 169133 sendto
   6.94 0.004430 0 159596 close
   6.85 0.004373 0 387359 epoll_ctl
   6.42 0.004102 0 19651 19651 connect
   5.54 0.003538 0 290289 recvmsg
   3.81 0.002431 0 116515 socket
   3.53 0.002254 0 164750 futex
   1.68 0.001075 0 207688 fcntl
   1.53 0.000974 0 95228 23139 recvfrom
   1.29 0.000821 0 33408 12259 accept
   1.14 0.000726 0 46582 stat
   1.11 0.000707 0 110826 bind
   0.85 0.000544 0 137574 getsockname
   0.32 0.000204 0 21642 getsockopt
   0.26 0.000165 0 39502 setsockopt
   0.01 0.000007 0 8092 lseek
   0.00 0.000000 0 248 open
   0.00 0.000000 0 4 brk
   0.00 0.000000 0 88 dup2
   0.00 0.000000 0 14 getdents
   0.00 0.000000 0 6 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 0.063864 2832429 82376 total

Do you have any suggestions to solve the issue, can I run the garbage
collector more frequently, is it better to change the cache_dir type
from aufs to something else?
Do you see the problem in the strace?

Thank you,
Elie
Received on Mon Nov 28 2011 - 09:17:52 MST

This archive was generated by hypermail 2.2.0 : Tue Nov 29 2011 - 12:00:03 MST