Re: ICAP connections under heavy loads

From: Alexander Komyagin <komyagin_at_altell.ru>
Date: Thu, 30 Aug 2012 12:47:02 +0400

On Wed, 2012-08-29 at 10:27 -0600, Alex Rousskov wrote:

>
>
> > Corresponding Xaction
> > objects are not destructed after client request timeout (I use 5 secs
> > for httperf requests)
>
> I am not intimately familiar with httperf (we use Web Polygraph), but I
> assume that httpperf immediately closes any timed out connection and
> Squid client-side code promptly notices the timeout. Please correct me
> if my assumptions are wrong.

Exactly. Connections are closed and Squid notices that correctly.
What makes me wonder is that HttpStateData async job is created in
forward.cc using httpStart(fwd) function:

void
httpStart(FwdState *fwd)
{
    debugs(11, 3, "httpStart: \"" <<
RequestMethodStr(fwd->request->method) << " " << fwd->entry->url() <<
"\"" );
    AsyncJob::Start(new HttpStateData(fwd));
}

And I do not see any way for the request forwarder to control that job!

For example, let's assume that our ICAP service successfully checked the
request but we are unable to connect to ICAP in order to check the reply
- just do NOP in Xaction openConnection() method if it was initiated by
HttpStateData.

This way, after client timeout occurs, Squid properly detects it but
nothing is done to stop HttpStateData job.

In real world Xaction will use noteCommTimeout to handle this case and
destruct the job; but still it looks strange for me that Squid doesn't
abort HttpStateData job in case the client is gone.

>
> If ICAP Xaction stays alive long after the corresponding HTTP client
> transaction is gone, then this could be a bug or deficiency. However,
> please note that Squid tries to relay all available information to the
> ICAP service (in case the ICAP service is logging it or needs it for
> some other important reason -- not all services just check for viruses).
> If an ICAP transaction still has things to do, it may outlive the
> corresponding HTTP transaction.
>
> We could make this "try as hard as you can to relay information to ICAP"
> behavior conditional on whether the service is "essential" or "optional"
> but a separate, dedicated setting may be warranted: I can imagine an
> optional logging or leak detection ICAP service that does not want to
> kill transactions if it is not working but still wants to receive all
> HTTP messages even if their corresponding HTTP clients are gone.
>
>
> > causing a lot of noteCommRead and noteCommWrote
> > exceptions at the same time (in 2-4 mins after the test) - when r/w
> > operations on the socket time out.
>
> That sounds normal to me (because your ICAP service is not responsive in
> this case).
>
>
> > As a consequence, Squid leaks FD's (for a while)
>
> By "leaks FD's (for a while)" do you mean that Squid uses more FDs than
> it would if the ICAP service was working? Or that there is actually an
> FD loss? The former is expected. The latter would be a bug.

>
> > and meaninglessly switches icap status in minutes after the test.
>
> Why do you describe a "down" status of an overloaded ICAP service as
> "meaningless"? The status sounds appropriate to me! Again, I do not know
> much about httperf internals, but in a real world (or a corresponding
> Polygraph test), the ICAP service may be always overwhelmed if there is
> too much traffic so a "down" state would be warranted in many such cases
> (Squid, of course, cannot predict whether the service overload is
> temporary).
>

I think it's better to detect ICAP "down" state as earlier as we can,
and certainly not after minutes after the load peak. The thing is that
in that 6 minutes ICAP service becomes responsive again, and then Squid
eventually catches dozens of exceptions and turns ICAP down. And that's
is a mistake, because ICAP is OK now!

Correct me if I'm wrong, but a proper way to fix the issue would be to
deny Xaction activity until the connection is fully established, since
the problem is not in the read/write operations but in the connection
opening.
This way Xaction object will detect connection fail after connection
timeout as it's supposed to be.

>
>
> > In Xaction new connections are created with ConnOpener job. ConnOpener
> > sets connection FD iff he _really_ thinks the connection is now
> > established (comm_connect_addr returned COMM_OK). But according to
> > `netstat` that connection was always in SYN_SENT state. Maybe I just
> > missed the point where it became ESTABLISHED. I will try to check it
> > later.
>
> Since Squid does not work on TCP packet level, it may think that the
> connection is "open" when, in fact, it is not fully established. In
> Squid context, isOpen() means that we can do things like schedule I/O
> for that connection or extract peer address. It does not mean
> ESTABLISHED in TCP sense.
>
>
> HTH,
>
> Alex.
>

-- 
Best wishes,
Alexander Komyagin
Received on Thu Aug 30 2012 - 08:50:48 MDT

This archive was generated by hypermail 2.2.0 : Thu Aug 30 2012 - 12:00:12 MDT