Re: ICAP connections under heavy loads

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 11 Sep 2012 11:51:44 +1200

On 11.09.2012 03:06, Alex Rousskov wrote:
> On 09/09/2012 02:30 AM, Amos Jeffries wrote:
>
>> IIRC the race is made worse by our internal code going timout event
>> (async) scheduling a call (async), so doubling the async queue
>> delay.
>> Which shows up worst on heavily loaded proxies.
>
> True. Unfortunately, the alternative is to deal with ConnOpener's
> sync
> call reentering ConnOpener object. We all know what that eventually
> leads to. I would rather have double async calls (and optimize the
> general queue handling) than deal with long chains of reentrant
> calls,
> especially when it comes to dealing with an exceptional event such as
> a
> timeout.
>
> I am sure this can be optimized further by writing reentrant code in
> performance-critical places, but we have much bigger fish to fry at
> the
> moment.
>
>
>> The patch woudl be this:
>>
>> === modified file 'src/comm/ConnOpener.cc'
>> --- src/comm/ConnOpener.cc 2012-08-28 19:12:13 +0000
>> +++ src/comm/ConnOpener.cc 2012-09-09 08:20:09 +0000
>> @@ -139,6 +139,10 @@
>> // it never reached fully open, so cleanup the FD handlers
>> // Note that comm_close() sequence does not happen for
>> partially open FD
>> Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE, NULL,
>> NULL, 0);
>> + if (calls_.earlyAbort_ != NULL) {
>> + calls_.earlyAbort_->cancel("Comm::ConnOpener::doneConnecting");
>> + calls_.earlyAbort_ = NULL;
>
> The above line is extra because we do that immediately below:
>
>> + }
>> calls_.earlyAbort_ = NULL;
>
>
>
>>>> I think from all this discussion the
>>>> ConnOpener::DelayedConnectRetry
>>>> handler needs to be investigated as well.
>
>>> Agreed. Why is that handler reusing the same failed socket? Should
>>> not
>>> we close the failed socket descriptor upon detecting a failure and
>>> then
>>> open a new one just before trying again?
>>
>> Hmm. Not sure. This is a straight copy of the older code.
>>
>> If you want a new socket FD most of what start() does needs to be
>> re-done for it. Except that the start timer needs to be left
>> unchanged,
>
> Exactly.
>
>> and the async calls need to be switched to teh new FD (would that be
>> a
>> reassignment, or a cancel + replace?)cancelled before replacing on
>> the
>> new FD.
>
> Ideally, when we fix this, we should avoid attaching any callbacks
> that
> may outlive a temporary FD to that temporary FD. For example, the
> timeout callback should be converted into an event, which is not a
> trivial change. AFAICT we do not have any other such callbacks (the
> earlyAbort one is specific to the temporary FD and can stay), but I
> have
> not checked closely.

Agreed. Using eventAdd() for the timeout is a great idea IMO.
Unfortunately eventAdd() does not accept AsyncCalls (yet) and would
require another wrapper doing async re-schedule. On the plus side that
means both sides of the race face the same lag handicap.

>>> Here is the sketch for connect retries on failures, as I see it:
>>>
>>> default:
>>> ++failRetries_; // TODO: rename to failedTries
>>>
>>> // we failed and have to start from scratch
>>> ... close the temporary socket ...
>>>
>>> // if we exhausted all retries, bail with an error
>>> if (failRetries_ > Config.connect_retries) {
>>> ... send COMM_ERR_CONNECT to the caller ...
>>> return;
>>> }
>>>
>>> // we can still retry after a small pause
>>> // we might timeout before that pause ends, but that is OK
>>> eventAdd("Comm::ConnOpener::DelayedConnectRetry", ...)
>>>
>>> More changes would be needed to make this work correctly:
>>>
>>> * the timeout handler should not be associated with the socket
>>> descriptor (since it may change). We can use eventAdd instead.
>>>
>>> * the Comm::ConnOpener::DelayedConnectRetry() wrapper should call a
>>> method that will open a new socket.
>>
>> Sounds okay to me.

I'm not strongly viewed on this, but is it better to display a hard
ERR_CONNECT or a soft(er) TIMEOUT in these cases? ie is it actually a
good idea to remove the timeout if case?

Amos
Received on Mon Sep 10 2012 - 23:51:47 MDT

This archive was generated by hypermail 2.2.0 : Tue Sep 11 2012 - 12:00:05 MDT