Re: ICAP connections under heavy loads

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 30 Aug 2012 08:29:31 -0600

On 08/30/2012 02:47 AM, Alexander Komyagin wrote:

> What makes me wonder is that HttpStateData async job is created in
> forward.cc using httpStart(fwd) function:
>
> void
> httpStart(FwdState *fwd)
> {
> debugs(11, 3, "httpStart: \"" <<
> RequestMethodStr(fwd->request->method) << " " << fwd->entry->url() <<
> "\"" );
> AsyncJob::Start(new HttpStateData(fwd));
> }
>
> And I do not see any way for the request forwarder to control that job!

Yes, the relationship among client-side, FwdState, and server-sde is
weird and wrong on many counts. With some minor exceptions, there is no
direct control of the server-side from the client-side and vice versa.

Client-side-driven termination of server-side jobs is usually triggered
via StoreEntry aborts, but HttpState does not receive StoreEntry abort
notifications (a design bug) -- it checks for them when something else
happens. This is something we should improve.

> For example, let's assume that our ICAP service successfully checked the
> request but we are unable to connect to ICAP in order to check the reply
> - just do NOP in Xaction openConnection() method if it was initiated by
> HttpStateData.
>
> This way, after client timeout occurs, Squid properly detects it but
> nothing is done to stop HttpStateData job.

Which may be the desired outcome in some cases (e.g.,quick_abort caching
and "I want to see everything" ICAP service).

> In real world Xaction will use noteCommTimeout to handle this case and
> destruct the job; but still it looks strange for me that Squid doesn't
> abort HttpStateData job in case the client is gone.

It is complicated because Squid has to balance the needs of each single
client with the needs of ICAP services and other clients (including
future ones if caching is enabled).

>>> and meaninglessly switches icap status in minutes after the test.
>>
>> Why do you describe a "down" status of an overloaded ICAP service as
>> "meaningless"? The status sounds appropriate to me! Again, I do not know
>> much about httperf internals, but in a real world (or a corresponding
>> Polygraph test), the ICAP service may be always overwhelmed if there is
>> too much traffic so a "down" state would be warranted in many such cases
>> (Squid, of course, cannot predict whether the service overload is
>> temporary).

> I think it's better to detect ICAP "down" state as earlier as we can,
> and certainly not after minutes after the load peak. The thing is that
> in that 6 minutes ICAP service becomes responsive again, and then Squid
> eventually catches dozens of exceptions and turns ICAP down. And that's
> is a mistake, because ICAP is OK now!

Some of the up/down decision logic is configurable, and if you have
suggestions on how to improve it, let's discuss them! The trick here is
to maximize the number of common ICAP service failures that the
algorithm can handle in a reasonable fashion, without adding 10 more
configuration options.

> Correct me if I'm wrong, but a proper way to fix the issue would be to
> deny Xaction activity until the connection is fully established, since
> the problem is not in the read/write operations but in the connection
> opening.

What do you mean by "deny Xaction activity"? I do not think current ICAP
transactions do much while they wait for the connection to be
established. They just accumulate virgin data and wait. What should they
do instead?

> This way Xaction object will detect connection fail after connection
> timeout as it's supposed to be.

If ICAP transactions do not detect connection timeout now, it is a bug
that we should fix.

Thank you,

Alex.
Received on Thu Aug 30 2012 - 14:30:07 MDT

This archive was generated by hypermail 2.2.0 : Fri Aug 31 2012 - 12:00:14 MDT