>>>>  Hi,
>>>>
>>>> I am currently facing a problem that I wasn't able to find a 
>>>> solution for in the mailing list or on the internet,
>>>> My squid is dying for 30 seconds every one hour at the same exact 
>>>> time, squid process will still be running,
>>>> I lose my wccp connectivity, the cache peers detect the squid as a 
>>>> dead sibling, and the squid cannot server any requests
>>>> The network connectivity of the sever is not affected (a ping to 
>>>> the squid's ip doesn't timeout)
>>>>
>>>> The problem doesn't start immediately when the squid is installed 
>>>> on the server (The server is dedicated as a squid)
>>>> It starts when the cache directories starts to fill up,
>>>> I have started my setup with 10 cache directors, the squid will 
>>>> start having the problem when the cache directories are above 50% 
>>>> filled
>>>> when i change the number of cache directory (9,8,...) the squid 
>>>> works for a while then the same problem
>>>> cache_dir aufs /cache1/squid 90000 140 256
>>>> cache_dir aufs /cache2/squid 90000 140 256
>>>> cache_dir aufs /cache3/squid 90000 140 256
>>>> cache_dir aufs /cache4/squid 90000 140 256
>>>> cache_dir aufs /cache5/squid 90000 140 256
>>>> cache_dir aufs /cache6/squid 90000 140 256
>>>> cache_dir aufs /cache7/squid 90000 140 256
>>>> cache_dir aufs /cache8/squid 90000 140 256
>>>> cache_dir aufs /cache9/squid 90000 140 256
>>>> cache_dir aufs /cache10/squid 80000 140 256
>>>>
>>>> I have 1 terabyte of storage
>>>> Finally I created two cache dircetories (One on each HDD) but the 
>>>> problem persisted
>>>
>>> You have 2 HDD?  but, but, you have 10 cache_dir.
>>>  We repeatedly say "one cache_dir per disk" or similar. In 
>>> particular one cache_dir per physical drive spindle (for "disks" 
>>> made up of multiple physical spindles) wherever possible with 
>>> physical drives/spindles mounting separately to ensure the pairing. 
>>> Squid performs a very unusual pattern of disk I/O which stress them 
>>> down to the hardware controller level and make this kind of detail 
>>> critical for anything like good speed. Avoiding cache_dir object 
>>> limitations by adding more UFS-based dirs to one disk does not 
>>> improve the situation.
>>>
>>> That is a problem which will be affecting your Squid all the time 
>>> though, possibly making the source of the pause worse.
>>>
>>> From teh description I believe it is garbage collection on the cache 
>>> directories. The pauses can be visible when garbage collecting any 
>>> caches over a few dozen GB. The squid default "swap_high" and 
>>> "swap_low" values are "5" apart, with at minimum being a value of 0 
>>> apart. These are whole % points of the total cache size, being 
>>> erased from disk in a somewhat random-access style across the cache 
>>> area. I did mention uncommon disk I/O patterns, right?
>>>
>>> To be sure what it is, you can use the "strace" tool to the squid 
>>> worker process (the second PID in current stable Squids) and see 
>>> what is running. But given the hourly regularity and past experience 
>>> with others on similar cache sizes, I'm almost certain its the 
>>> garbage collection.
>>>
>>> Amos
>>>
>>
>> Hi Amos,
>>
>> Thank you for your fast reply,
>> I have 2 HDD (450GB and 600GB)
>> df -h displays that i have 357Gb and 505GB available
>> In my last test, my cache dir where:
>> cache_swap_low 90
>> cache_swap_high 95
>
> This is not. For anything more than 10-20 GB I recommend setting it to 
> no more than 1 apart, possibly the same value if that works.
> Squid has a light but CPU-intensive and possibly long garbage removal 
> cycle above cache_swap_low, and a much more aggressive but faster and 
> less CPU intensive removal above cache_swap_high. On large caches it 
> is better in terms of downtime going straight to the aggressive 
> removal and clearing disk space fast, despite the bandwidth cost 
> replacing any items the light removal would have left.
>
>
> Amos
>
Hi Amos,
I have changed the swap_high  90 and swap_low 90 with two cache dir (one 
for each HDD), i still have the same problem,
I did an strace (when the problem occured)
------ ----------- ----------- --------- --------- ----------------
  23.06    0.004769           0     85681        96 write
  21.07    0.004359           0     24658         5 futex
  19.34    0.004001         800         5           open
   6.54    0.001352           0      5101      5101 connect
   6.46    0.001337           3       491           epoll_wait
   5.34    0.001104           0     51938      9453 read
   3.90    0.000806           0     39727           close
   3.54    0.000733           0     86400           epoll_ctl
   3.54    0.000732           0     32357           sendto
   2.02    0.000417           0     56721           recvmsg
   1.84    0.000381           0     24064           socket
   0.96    0.000199           0     56264           fcntl
   0.77    0.000159           0      6366       329 accept
   0.53    0.000109           0     24033           bind
   0.52    0.000108           0     30085           getsockname
   0.21    0.000044           0     11200           stat
   0.21    0.000044           0      6998       359 recvfrom
   0.09    0.000019           0      5085           getsockopt
   0.06    0.000012           0      2887           lseek
   0.00    0.000000           0        98           brk
   0.00    0.000000           0        16           dup2
   0.00    0.000000           0     10314           setsockopt
   0.00    0.000000           0         4           getdents
   0.00    0.000000           0         3           getrusage
------ ----------- ----------- --------- --------- ----------------
100.00    0.020685                560496     15343 total
this is the strace of squid when it is working normally:
------ ----------- ----------- --------- --------- ----------------
  24.88    0.015887           0    455793       169 write
  13.72    0.008764           0    112185           epoll_wait
  11.67    0.007454           0    256256     27158 read
   8.47    0.005408           0    169133           sendto
   6.94    0.004430           0    159596           close
   6.85    0.004373           0    387359           epoll_ctl
   6.42    0.004102           0     19651     19651 connect
   5.54    0.003538           0    290289           recvmsg
   3.81    0.002431           0    116515           socket
   3.53    0.002254           0    164750           futex
   1.68    0.001075           0    207688           fcntl
   1.53    0.000974           0     95228     23139 recvfrom
   1.29    0.000821           0     33408     12259 accept
   1.14    0.000726           0     46582           stat
   1.11    0.000707           0    110826           bind
   0.85    0.000544           0    137574           getsockname
   0.32    0.000204           0     21642           getsockopt
   0.26    0.000165           0     39502           setsockopt
   0.01    0.000007           0      8092           lseek
   0.00    0.000000           0       248           open
   0.00    0.000000           0         4           brk
   0.00    0.000000           0        88           dup2
   0.00    0.000000           0        14           getdents
   0.00    0.000000           0         6           getrusage
------ ----------- ----------- --------- --------- ----------------
100.00    0.063864               2832429     82376 total
Do you have any suggestions to solve the issue, can I run the garbage 
collector more frequently, is it better to change the cache_dir type 
from aufs to something else?
Do you see the problem in the strace?
Thank you,
Elie
Received on Mon Nov 28 2011 - 09:17:52 MST
This archive was generated by hypermail 2.2.0 : Tue Nov 29 2011 - 12:00:03 MST