RE: [squid-users] Squid Crashed - then ran out of file descriptors after restart

From: Justin Lawler <jlawler_at_amdocs.com>
Date: Mon, 7 Nov 2011 05:01:15 +0000

Thanks Amos.

Checked that ' mgr:filedescriptors' - wasn't aware of this functionality before.

95% of the file descriptors have the Description ' Waiting for next request'

---------------------------------------------------------------------------------
21 Socket 40 222* 5237 10.1.5.53:53279 Waiting for next request
  22 Socket 30 264* 5237 10.1.5.53:53029 Waiting for next request
  23 Socket 95 266* 540530 10.1.5.53:54489 Waiting for next request
  24 Socket 110 321* 539782 10.1.5.54:53322 Waiting for next request
  25 Socket 49 306* 5344 10.1.5.53:53435 Waiting for next request
  26 Socket 29 295* 130535 10.1.5.54:51444 Waiting for next request
---------------------------------------------------------------------------------

This is in a performance lab, with a load injector hitting squid, trying to simulate a production system. Could it be a problem with the load injector not closing connections? Could the load injector have opened many connections when squid was unresponsive (because ICAP was unresponsive), and those connections never got closed?

Thanks,
Justin


-----Original Message-----
From: Amos Jeffries [mailto:squid3_at_treenet.co.nz]
Sent: Monday, November 07, 2011 8:14 AM
To: squid-users_at_squid-cache.org
Subject: RE: [squid-users] Squid Crashed - then ran out of file descriptors after restart

 On Sun, 6 Nov 2011 10:52:41 +0000, Justin Lawler wrote:
> Thanks Amos for the explanation. Apologies for the lack of clarity.
>
> FYI - we have ICAP connection set up to be a 'critical' service.
>
> Do you know if the squid ICAP functionality has changed between 3.0 &
> 3.1? we were not seeing some of these issues previously. For instance,
> if the ICAP server went down previously - after the ICAP timeout
> (icap_io_timeout), squid clients would just receive 500 responses for
> all queued connections (as seen in the squid access logs), effectively
> limiting the number of connections to be queued. Are you saying that
> *all* client connections will now be queued - even after the timeout?

 All client requests are queued anyway. The only difference is the length of queue vs processing time vs duration of timeouts. Normally these are nowhere near colliding, with quick processing and long timeout. Under spike conditions the 500 happens when the queue fills or timeout gets hit. When the FD count runs low Squid leaves new connections in the TCP buffers and they can cannot connect errors instead.

 The only big change I can think of between 3.0 and 3.1 is that 3.0 queued as LIFO and serviced new traffic fastest. Leaving old to timeout and die. 3.1 fixed that and services FIFO queues of all clients, spreading the lag more evenly across the set.

 3.1 has additional capabilities in ICAP service so its possibly different. You will have to contact the ICAP developers (at
 measurement-factory.com) for particular details in this area.


> If ICAP server went down for a long period, and connections kept on
> being made to squid, the number of queued connections would be very
> big. This could easily stop squid from being responsive for a long
> time, or completely - would this be correct?

 Yes. After a long period you should see the 500's with a timeout error page start appearing in 3.1 as well AFAIK.

>
> Note - we are still seeing very high file descriptor usage in squid
> still - even 3 hours after the restarts. Currently file descriptor is
> still about 3.2k, and has never gone much below this number. Would it
> take this long to rebuild the journal?

 Maybe. Has cache.log mentioned swap rebuild completion or still periodically logging a "N done" messages? I've seen a 200GB caches take a day or so to rebuild.

>
> I'm just noticing the 'TIME_WAIT' connections between squid & ICAP
> server is also very high - above 2k - and has been like this for the
> last 30 minutes. Is this anything to worry about? The number of
> 'ESTABLISHED' connections never goes above 50.

 This could just be a sign that your services are processing more than
 40 requests per second over the last few minutes.
  Or that TCP is holding the sockets in TIME_WAIT a bit long (can be
 dangerous to touch that, so careful there).
  Or that Squid was holding a lot of persistent connections from the
 traffic spike and released them recently (~15 minutes in TIME_WAIT is
 normal IIRC). Although ESTABLISHED being low at the start indicates its
 probably not this.


>
> Queue Congestion Errors:
> 2011/11/06 08:03:07| squidaio_queue_request: WARNING - Queue
> congestion
>

 That is Disk I/O queue. From the cache scan.
 http://wiki.squid-cache.org/KnowledgeBase/QueueCongestion


 The squid "filedescriptors" management report (squidclient
 mgr:filedescriptors) lists what each FD is used for, you can check it
 for how many of those ~3-4K are for disk files and how many for network
 sockets.

 Amos


> -----Original Message-----
> From: Amos Jeffries
>
> On 6/11/2011 9:15 p.m., Justin Lawler wrote:
>> Hi,
>>
>> We're running squid 3.1.16 on solaris on a sparc box.
>>
>> We're running it against an ICAP server, and were testing some
>> scenarios when ICAP server went down, how squid would handle it. After
>> freezing the ICAP server, squid seemed to have big problems.
>
> For reference the exected behaviour is this:
>
> ** if squid is configured to allow bypass of the ICAP service
> --> no noticable problems. possibly faster response time for
> clients.
>
> ** if squid is configured not to bypass on failures (ie critical ICAP
> service)
> --> New connections continue to be accepted.
> --> All traffic needing ICAP halts waiting recovery, RAM and FD
> consumption rises until available resources are full.
> --> On ICAP recovery the traffic being held gets sent to it and
> service resumes as the results come back.
>
>>
>> Once it was back up again, it kept on sending OPTION requests to the
>> server - but squid itself became completely unresponsive. It wouldn't
>> accept any further requests, you couldn't use squidclient against it
>> or doing a squid reconfigure, and was not responding to 'squid -k
>> shutdown', so had to be manually killed with a 'kill -9'.
>
> This description is not very clear. You seem to use "it" torefer to
> several different things in first sentence of paragraph 2.
>
> Apparently:
> * "it" comes back up again. ... apparently refering to ICAP? >>>>>
> JL (yes)
> * "it" sends OPTION requests ... apparently referring to Squid now?
> or to some unmentioned backend part of the ICAP service? >>>>> JL
> (referring to squid)
> * squid itself is unresponsive .... waiting for queued requets to
> get through ICAP and the network fetch stages perhapse? noting that
> ICAP may be slowed as it faces teh spike or waiting traffic from
> Squid.
>
>
>>
>> We then restarted the squid instance, and it started to go crazy,
>> file
>> descriptors reaching the limit (4096 - previously it never went
>> above
>> 1k during long
>
> "kill -9" causes Squid to terminate before savign teh cache index or
> closing the journal properly. Thus on restart the journal is
> discovered corrupt and a "DIRTY" rebuild is begun. Scanning the
> entire
> disk cache object by object to rebuild the index and journa contents.
> This can consume a lot of FD, for a period of time proportional to
> the
> size of your disk cache(s).
>
> Also, clients can hit Squid with a lot of connections that
> accumulated during the outage. Which each have to be processed in
> full, including all lookups and tests. Immediately. This startup
> spike
> is normal immediately after a start/restart or reconfigure when all
> the active running state is erased and requires rebuilding.
>
> The lag problems and resource/queue overloads can be expected to drop
> away relatively quickly as the nromal running state gets rebuilt from
> the new traffic. The FD consumption from cache scan will disappear
> abruptly when that process completes.
>
>> stability test runs), and a load of 'Queue Congestion' errors in the
>> logs. Tried to restart it again, and it seemed to behave better then,
>> but still the number of file descriptors is very big (above 3k).
>
> Any particular queue mentioned?
>
>
> Amos



This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp
Received on Mon Nov 07 2011 - 05:01:27 MST

This archive was generated by hypermail 2.2.0 : Mon Nov 07 2011 - 12:00:03 MST