Re: [squid-users] Squid losing connectivity for 30 seconds from Elie Merhej on 2011-11-29 (squid-users)

From: Elie Merhej <emerhej_at_wise.net.lb>
Date: Tue, 29 Nov 2011 10:43:58 +0200

>>>>>> Hi,
>>>>>>
>>>>>> I am currently facing a problem that I wasn't able to find a
>>>>>> solution for in the mailing list or on the internet,
>>>>>> My squid is dying for 30 seconds every one hour at the same exact
>>>>>> time, squid process will still be running,
>>>>>> I lose my wccp connectivity, the cache peers detect the squid as
>>>>>> a dead sibling, and the squid cannot server any requests
>>>>>> The network connectivity of the sever is not affected (a ping to
>>>>>> the squid's ip doesn't timeout)
>>>>>>
>>>>>> The problem doesn't start immediately when the squid is installed
>>>>>> on the server (The server is dedicated as a squid)
>>>>>> It starts when the cache directories starts to fill up,
>>>>>> I have started my setup with 10 cache directors, the squid will
>>>>>> start having the problem when the cache directories are above 50%
>>>>>> filled
>>>>>> when i change the number of cache directory (9,8,...) the squid
>>>>>> works for a while then the same problem
>>>>>> cache_dir aufs /cache1/squid 90000 140 256
>>>>>> cache_dir aufs /cache2/squid 90000 140 256
>>>>>> cache_dir aufs /cache3/squid 90000 140 256
>>>>>> cache_dir aufs /cache4/squid 90000 140 256
>>>>>> cache_dir aufs /cache5/squid 90000 140 256
>>>>>> cache_dir aufs /cache6/squid 90000 140 256
>>>>>> cache_dir aufs /cache7/squid 90000 140 256
>>>>>> cache_dir aufs /cache8/squid 90000 140 256
>>>>>> cache_dir aufs /cache9/squid 90000 140 256
>>>>>> cache_dir aufs /cache10/squid 80000 140 256
>>>>>>
>>>>>> I have 1 terabyte of storage
>>>>>> Finally I created two cache dircetories (One on each HDD) but the
>>>>>> problem persisted
>>>>>
>>>>> You have 2 HDD? but, but, you have 10 cache_dir.
>>>>> We repeatedly say "one cache_dir per disk" or similar. In
>>>>> particular one cache_dir per physical drive spindle (for "disks"
>>>>> made up of multiple physical spindles) wherever possible with
>>>>> physical drives/spindles mounting separately to ensure the
>>>>> pairing. Squid performs a very unusual pattern of disk I/O which
>>>>> stress them down to the hardware controller level and make this
>>>>> kind of detail critical for anything like good speed. Avoiding
>>>>> cache_dir object limitations by adding more UFS-based dirs to one
>>>>> disk does not improve the situation.
>>>>>
>>>>> That is a problem which will be affecting your Squid all the time
>>>>> though, possibly making the source of the pause worse.
>>>>>
>>>>> From teh description I believe it is garbage collection on the
>>>>> cache directories. The pauses can be visible when garbage
>>>>> collecting any caches over a few dozen GB. The squid default
>>>>> "swap_high" and "swap_low" values are "5" apart, with at minimum
>>>>> being a value of 0 apart. These are whole % points of the total
>>>>> cache size, being erased from disk in a somewhat random-access
>>>>> style across the cache area. I did mention uncommon disk I/O
>>>>> patterns, right?
>>>>>
>>>>> To be sure what it is, you can use the "strace" tool to the squid
>>>>> worker process (the second PID in current stable Squids) and see
>>>>> what is running. But given the hourly regularity and past
>>>>> experience with others on similar cache sizes, I'm almost certain
>>>>> its the garbage collection.
>>>>>
>>>>> Amos
>>>>>
>>>>
>>>> Hi Amos,
>>>>
>>>> Thank you for your fast reply,
>>>> I have 2 HDD (450GB and 600GB)
>>>> df -h displays that i have 357Gb and 505GB available
>>>> In my last test, my cache dir where:
>>>> cache_swap_low 90
>>>> cache_swap_high 95
>>>
>>> This is not. For anything more than 10-20 GB I recommend setting it
>>> to no more than 1 apart, possibly the same value if that works.
>>> Squid has a light but CPU-intensive and possibly long garbage
>>> removal cycle above cache_swap_low, and a much more aggressive but
>>> faster and less CPU intensive removal above cache_swap_high. On
>>> large caches it is better in terms of downtime going straight to the
>>> aggressive removal and clearing disk space fast, despite the
>>> bandwidth cost replacing any items the light removal would have left.
>>>
>>>
>>> Amos
>>>
>> Hi Amos,
>>
>> I have changed the swap_high 90 and swap_low 90 with two cache dir
>> (one for each HDD), i still have the same problem,
>> I did an strace (when the problem occured)
>> ------ ----------- ----------- --------- --------- ----------------
>> 23.06 0.004769 0 85681 96 write
>> 21.07 0.004359 0 24658 5 futex
>> 19.34 0.004001 800 5 open
>> 6.54 0.001352 0 5101 5101 connect
>> 6.46 0.001337 3 491 epoll_wait
>> 5.34 0.001104 0 51938 9453 read
>> 3.90 0.000806 0 39727 close
>> 3.54 0.000733 0 86400 epoll_ctl
>> 3.54 0.000732 0 32357 sendto
>> 2.02 0.000417 0 56721 recvmsg
>> 1.84 0.000381 0 24064 socket
>> 0.96 0.000199 0 56264 fcntl
>> 0.77 0.000159 0 6366 329 accept
>> 0.53 0.000109 0 24033 bind
>> 0.52 0.000108 0 30085 getsockname
>> 0.21 0.000044 0 11200 stat
>> 0.21 0.000044 0 6998 359 recvfrom
>> 0.09 0.000019 0 5085 getsockopt
>> 0.06 0.000012 0 2887 lseek
>> 0.00 0.000000 0 98 brk
>> 0.00 0.000000 0 16 dup2
>> 0.00 0.000000 0 10314 setsockopt
>> 0.00 0.000000 0 4 getdents
>> 0.00 0.000000 0 3 getrusage
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00 0.020685 560496 15343 total
>>
>> this is the strace of squid when it is working normally:
>> ------ ----------- ----------- --------- --------- ----------------
>> 24.88 0.015887 0 455793 169 write
>> 13.72 0.008764 0 112185 epoll_wait
>> 11.67 0.007454 0 256256 27158 read
>> 8.47 0.005408 0 169133 sendto
>> 6.94 0.004430 0 159596 close
>> 6.85 0.004373 0 387359 epoll_ctl
>> 6.42 0.004102 0 19651 19651 connect
>> 5.54 0.003538 0 290289 recvmsg
>> 3.81 0.002431 0 116515 socket
>> 3.53 0.002254 0 164750 futex
>> 1.68 0.001075 0 207688 fcntl
>> 1.53 0.000974 0 95228 23139 recvfrom
>> 1.29 0.000821 0 33408 12259 accept
>> 1.14 0.000726 0 46582 stat
>> 1.11 0.000707 0 110826 bind
>> 0.85 0.000544 0 137574 getsockname
>> 0.32 0.000204 0 21642 getsockopt
>> 0.26 0.000165 0 39502 setsockopt
>> 0.01 0.000007 0 8092 lseek
>> 0.00 0.000000 0 248 open
>> 0.00 0.000000 0 4 brk
>> 0.00 0.000000 0 88 dup2
>> 0.00 0.000000 0 14 getdents
>> 0.00 0.000000 0 6 getrusage
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00 0.063864 2832429 82376 total
>>
>> Do you have any suggestions to solve the issue, can I run the garbage
>> collector more frequently, is it better to change the cache_dir type
>> from aufs to something else?
>> Do you see the problem in the strace?
>>
>> Thank you,
>> Elie
>>
>>
> Hi,
>
> Please note that squid is facing the same problem even when their is
> no activity or any clients connected to it
>
> Regards
> Elie
>
Hi,

here is the strace result
-----------------------------------------------------------------------------------------------------
connect(7595, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("2.22.50.75")}, 16) = -1 EINPROGRESS (Operation now
in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 7595, {EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=7595, u64=13133095884688465323}}) = 0
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\375\254\21\0002r\316N\0\0\0\0002r\316N\0\0\0\0\262\245\257P\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\306\254\21\0000r\316N\0\0\0\0000r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\262\251\21\0\360\376\315N\0\0\0\0002r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\2\255\21\0002r\316N\0\0\0\0003r\316N\0\0\0\0\321\376\316N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\3\255\21\0002r\316N\0\0\0\0003r\316N\0\0\0\0002\347\340N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\371\254\21\0002r\316N\0\0\0\0002r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\n\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0\263\245\257P\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\v\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0\323\370\317N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\f\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0003\377\365N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\334\251\21\0@\\\316N\0\0\0\0003r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
read(165, "!", 256) = 1
close(379) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 9525, {0, {u32=9525,
u64=25160128873375029}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 9525, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=9525,
u64=13133094175291483445}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 9525, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=9525, u64=23686740342482229}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 4554, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=4554, u64=549755818442}}) = 0
close(68) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 68
fcntl(68, F_GETFD) = 0
fcntl(68, F_SETFD, FD_CLOEXEC) = 0
setsockopt(68, SOL_IP, 0x13 /* IP_??? */, [1], 4) = 0
bind(68, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("95.141.62.17")}, 16) = 0
fcntl(68, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(68, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(68, SOL_TCP, TCP_NODELAY, [1], 4) = 0
getsockname(68, {sa_family=AF_INET, sin_port=htons(60303),
sin_addr=inet_addr("95.141.62.17")}, [16]) = 0
socket(PF_NETLINK, SOCK_RAW, 0) = 379
bind(379, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(379, {sa_family=AF_NETLINK, pid=11594, groups=00000000},
[12]) = 0
sendto(379, "\24\0\0\0\26\0\1\3t\232\324N\0\0\0\0\0\0\0\0", 20, 0,
{sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"0\0\0\0\24\0\2\0t\232\324NJ-\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"...,
4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"@\0\0\0\24\0\2\0t\232\324NJ-\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"...,
4096}], msg_controllen=0, msg_flags=0}, 0) = 128
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\24\0\0\0\3\0\2\0t\232\324NJ-\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"...,
4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(379) = 0
connect(68, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("2.22.50.65")}, 16) = -1 EINPROGRESS (Operation now
in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 68, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=68,
u64=13133092173836714052}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 3658, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=3658, u64=549755817546}}) = 0
sendto(10, "\33\221\1\0\0\1\0\0\0\0\0\0\nliveupdate\vgocyberl"..., 44,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 44
sendto(10, "\275\251\1\0\0\1\0\0\0\0\0\0\2tc\2v3\6cache3\1c\4pack"...,
48, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 48
sendto(10, "\267\355\1\0\0\1\0\0\0\0\0\0\3crl\6thawte\3com\0\0\34\0\1",
32, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\312\237\1\0\0\1\0\0\0\0\0\0\2p4\rbhzfrwddqtb66\20cg"...,
79, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 79
sendto(10, "\261\216\1\0\0\1\0\0\0\0\0\0\4apps\10facebook\3com\0\0"...,
35, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 35
sendto(10, "\237\10\1\0\0\1\0\0\0\0\0\0\6sn121w\6snt121\4mail\4"..., 45,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 45
sendto(10, "FR\1\0\0\1\0\0\0\0\0\0\3api\6js-kit\3com\0\0\1\0\1", 32, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\207$\1\0\0\1\0\0\0\0\0\0\1l\naddthiscdn\3com\0\0\1"..., 34,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 34
sendto(10, "\265#\1\0\0\1\0\0\0\0\0\0\5usage\7hosting\7toolb"..., 60, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 60
sendto(10, "hZ\1\0\0\1\0\0\0\0\0\0\3www\ntargetgulf\3com\0"..., 36, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 36
sendto(10, "\355\345\1\0\0\1\0\0\0\0\0\0\10services\4apps\7condu"...,
43, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 43
sendto(10, "O\254\1\0\0\1\0\0\0\0\0\0\0060-jw-w\7channel\10face"..., 45,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 45
sendto(10, "\330\310\1\0\0\1\0\0\0\0\0\0\2t1\7gstatic\3com\0\0\1\0\1",
32, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\16Z\1\0\0\1\0\0\0\0\0\0\4mail\3aol\3com\0\0\1\0\1", 30, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 30
sendto(10, "\327\216\1\0\0\1\0\0\0\0\0\0\3www\10bisara7a\3com\0\0\1"...,
34, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 34
sendto(10, "o\201\1\0\0\1\0\0\0\0\0\0\003317\7channel\10faceboo"..., 42,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 42
sendto(10, "\344?\1\0\0\1\0\0\0\0\0\0\7email03\fsecureserve"..., 42, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 42
sendto(21,
"\0\0\0\n\2\0\0|\0\0\0\4\0\0\0\0\0\1\0\30\0017\360\6\0\0\0002\0P\0\0"..., 132,
0, NULL, 0) = 132
sendto(21,
"\0\0\0\n\2\0\0|\0\0\0\4\0\0\0\0\0\1\0\30\1<\360\6\0\0\0\21\0P\0\0"...,
132, 0, NULL, 0) = 132
open("/cache2/pool6/squid/13/AB",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 379
getdents(379, /* 254 entries */, 32768) = 8112
getdents(379, /* 0 entries */, 32768) = 0
close(379) = 0
read(165, "!", 256) = 1
----------------------------------------------------------------------------------------------------------------------------------------------------
Squid is freezing at this point
Please advise

Thank you
Elie
Received on Tue Nov 29 2011 - 08:44:08 MST

This archive was generated by hypermail 2.2.0 : Tue Nov 29 2011 - 12:00:03 MST