Re: [squid-users] Squid losing connectivity for 30 seconds from Elie Merhej on 2011-11-28 (squid-users)

From: Elie Merhej <emerhej_at_wise.net.lb>
Date: Tue, 29 Nov 2011 08:04:54 +0200

>>>>> Hi,
>>>>>
>>>>> I am currently facing a problem that I wasn't able to find a
>>>>> solution for in the mailing list or on the internet,
>>>>> My squid is dying for 30 seconds every one hour at the same exact
>>>>> time, squid process will still be running,
>>>>> I lose my wccp connectivity, the cache peers detect the squid as a
>>>>> dead sibling, and the squid cannot server any requests
>>>>> The network connectivity of the sever is not affected (a ping to
>>>>> the squid's ip doesn't timeout)
>>>>>
>>>>> The problem doesn't start immediately when the squid is installed
>>>>> on the server (The server is dedicated as a squid)
>>>>> It starts when the cache directories starts to fill up,
>>>>> I have started my setup with 10 cache directors, the squid will
>>>>> start having the problem when the cache directories are above 50%
>>>>> filled
>>>>> when i change the number of cache directory (9,8,...) the squid
>>>>> works for a while then the same problem
>>>>> cache_dir aufs /cache1/squid 90000 140 256
>>>>> cache_dir aufs /cache2/squid 90000 140 256
>>>>> cache_dir aufs /cache3/squid 90000 140 256
>>>>> cache_dir aufs /cache4/squid 90000 140 256
>>>>> cache_dir aufs /cache5/squid 90000 140 256
>>>>> cache_dir aufs /cache6/squid 90000 140 256
>>>>> cache_dir aufs /cache7/squid 90000 140 256
>>>>> cache_dir aufs /cache8/squid 90000 140 256
>>>>> cache_dir aufs /cache9/squid 90000 140 256
>>>>> cache_dir aufs /cache10/squid 80000 140 256
>>>>>
>>>>> I have 1 terabyte of storage
>>>>> Finally I created two cache dircetories (One on each HDD) but the
>>>>> problem persisted
>>>>
>>>> You have 2 HDD? but, but, you have 10 cache_dir.
>>>> We repeatedly say "one cache_dir per disk" or similar. In
>>>> particular one cache_dir per physical drive spindle (for "disks"
>>>> made up of multiple physical spindles) wherever possible with
>>>> physical drives/spindles mounting separately to ensure the pairing.
>>>> Squid performs a very unusual pattern of disk I/O which stress them
>>>> down to the hardware controller level and make this kind of detail
>>>> critical for anything like good speed. Avoiding cache_dir object
>>>> limitations by adding more UFS-based dirs to one disk does not
>>>> improve the situation.
>>>>
>>>> That is a problem which will be affecting your Squid all the time
>>>> though, possibly making the source of the pause worse.
>>>>
>>>> From teh description I believe it is garbage collection on the
>>>> cache directories. The pauses can be visible when garbage
>>>> collecting any caches over a few dozen GB. The squid default
>>>> "swap_high" and "swap_low" values are "5" apart, with at minimum
>>>> being a value of 0 apart. These are whole % points of the total
>>>> cache size, being erased from disk in a somewhat random-access
>>>> style across the cache area. I did mention uncommon disk I/O
>>>> patterns, right?
>>>>
>>>> To be sure what it is, you can use the "strace" tool to the squid
>>>> worker process (the second PID in current stable Squids) and see
>>>> what is running. But given the hourly regularity and past
>>>> experience with others on similar cache sizes, I'm almost certain
>>>> its the garbage collection.
>>>>
>>>> Amos
>>>>
>>>
>>> Hi Amos,
>>>
>>> Thank you for your fast reply,
>>> I have 2 HDD (450GB and 600GB)
>>> df -h displays that i have 357Gb and 505GB available
>>> In my last test, my cache dir where:
>>> cache_swap_low 90
>>> cache_swap_high 95
>>
>> This is not. For anything more than 10-20 GB I recommend setting it
>> to no more than 1 apart, possibly the same value if that works.
>> Squid has a light but CPU-intensive and possibly long garbage removal
>> cycle above cache_swap_low, and a much more aggressive but faster and
>> less CPU intensive removal above cache_swap_high. On large caches it
>> is better in terms of downtime going straight to the aggressive
>> removal and clearing disk space fast, despite the bandwidth cost
>> replacing any items the light removal would have left.
>>
>>
>> Amos
>>
> Hi Amos,
>
> I have changed the swap_high 90 and swap_low 90 with two cache dir
> (one for each HDD), i still have the same problem,
> I did an strace (when the problem occured)
> ------ ----------- ----------- --------- --------- ----------------
> 23.06 0.004769 0 85681 96 write
> 21.07 0.004359 0 24658 5 futex
> 19.34 0.004001 800 5 open
> 6.54 0.001352 0 5101 5101 connect
> 6.46 0.001337 3 491 epoll_wait
> 5.34 0.001104 0 51938 9453 read
> 3.90 0.000806 0 39727 close
> 3.54 0.000733 0 86400 epoll_ctl
> 3.54 0.000732 0 32357 sendto
> 2.02 0.000417 0 56721 recvmsg
> 1.84 0.000381 0 24064 socket
> 0.96 0.000199 0 56264 fcntl
> 0.77 0.000159 0 6366 329 accept
> 0.53 0.000109 0 24033 bind
> 0.52 0.000108 0 30085 getsockname
> 0.21 0.000044 0 11200 stat
> 0.21 0.000044 0 6998 359 recvfrom
> 0.09 0.000019 0 5085 getsockopt
> 0.06 0.000012 0 2887 lseek
> 0.00 0.000000 0 98 brk
> 0.00 0.000000 0 16 dup2
> 0.00 0.000000 0 10314 setsockopt
> 0.00 0.000000 0 4 getdents
> 0.00 0.000000 0 3 getrusage
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.020685 560496 15343 total
>
> this is the strace of squid when it is working normally:
> ------ ----------- ----------- --------- --------- ----------------
> 24.88 0.015887 0 455793 169 write
> 13.72 0.008764 0 112185 epoll_wait
> 11.67 0.007454 0 256256 27158 read
> 8.47 0.005408 0 169133 sendto
> 6.94 0.004430 0 159596 close
> 6.85 0.004373 0 387359 epoll_ctl
> 6.42 0.004102 0 19651 19651 connect
> 5.54 0.003538 0 290289 recvmsg
> 3.81 0.002431 0 116515 socket
> 3.53 0.002254 0 164750 futex
> 1.68 0.001075 0 207688 fcntl
> 1.53 0.000974 0 95228 23139 recvfrom
> 1.29 0.000821 0 33408 12259 accept
> 1.14 0.000726 0 46582 stat
> 1.11 0.000707 0 110826 bind
> 0.85 0.000544 0 137574 getsockname
> 0.32 0.000204 0 21642 getsockopt
> 0.26 0.000165 0 39502 setsockopt
> 0.01 0.000007 0 8092 lseek
> 0.00 0.000000 0 248 open
> 0.00 0.000000 0 4 brk
> 0.00 0.000000 0 88 dup2
> 0.00 0.000000 0 14 getdents
> 0.00 0.000000 0 6 getrusage
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.063864 2832429 82376 total
>
> Do you have any suggestions to solve the issue, can I run the garbage
> collector more frequently, is it better to change the cache_dir type
> from aufs to something else?
> Do you see the problem in the strace?
>
> Thank you,
> Elie
>
>
Hi,

Please note that squid is facing the same problem even when their is no
activity or any clients connected to it

Regards
Elie
Received on Tue Nov 29 2011 - 06:05:04 MST

This archive was generated by hypermail 2.2.0 : Tue Nov 29 2011 - 12:00:03 MST