Re: ICAP connections under heavy loads

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 10 Sep 2012 09:06:46 -0600

On 09/09/2012 02:30 AM, Amos Jeffries wrote:

> IIRC the race is made worse by our internal code going timout event
> (async) scheduling a call (async), so doubling the async queue delay.
> Which shows up worst on heavily loaded proxies.

True. Unfortunately, the alternative is to deal with ConnOpener's sync
call reentering ConnOpener object. We all know what that eventually
leads to. I would rather have double async calls (and optimize the
general queue handling) than deal with long chains of reentrant calls,
especially when it comes to dealing with an exceptional event such as a
timeout.

I am sure this can be optimized further by writing reentrant code in
performance-critical places, but we have much bigger fish to fry at the
moment.

> The patch woudl be this:
>
> === modified file 'src/comm/ConnOpener.cc'
> --- src/comm/ConnOpener.cc 2012-08-28 19:12:13 +0000
> +++ src/comm/ConnOpener.cc 2012-09-09 08:20:09 +0000
> @@ -139,6 +139,10 @@
> // it never reached fully open, so cleanup the FD handlers
> // Note that comm_close() sequence does not happen for
> partially open FD
> Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE, NULL, NULL, 0);
> + if (calls_.earlyAbort_ != NULL) {
> + calls_.earlyAbort_->cancel("Comm::ConnOpener::doneConnecting");
> + calls_.earlyAbort_ = NULL;

The above line is extra because we do that immediately below:

> + }
> calls_.earlyAbort_ = NULL;

>>> I think from all this discussion the ConnOpener::DelayedConnectRetry
>>> handler needs to be investigated as well.

>> Agreed. Why is that handler reusing the same failed socket? Should not
>> we close the failed socket descriptor upon detecting a failure and then
>> open a new one just before trying again?
>
> Hmm. Not sure. This is a straight copy of the older code.
>
> If you want a new socket FD most of what start() does needs to be
> re-done for it. Except that the start timer needs to be left unchanged,

Exactly.

> and the async calls need to be switched to teh new FD (would that be a
> reassignment, or a cancel + replace?)cancelled before replacing on the
> new FD.

Ideally, when we fix this, we should avoid attaching any callbacks that
may outlive a temporary FD to that temporary FD. For example, the
timeout callback should be converted into an event, which is not a
trivial change. AFAICT we do not have any other such callbacks (the
earlyAbort one is specific to the temporary FD and can stay), but I have
not checked closely.

Thank you,

Alex.

>> Here is the sketch for connect retries on failures, as I see it:
>>
>> default:
>> ++failRetries_; // TODO: rename to failedTries
>>
>> // we failed and have to start from scratch
>> ... close the temporary socket ...
>>
>> // if we exhausted all retries, bail with an error
>> if (failRetries_ > Config.connect_retries) {
>> ... send COMM_ERR_CONNECT to the caller ...
>> return;
>> }
>>
>> // we can still retry after a small pause
>> // we might timeout before that pause ends, but that is OK
>> eventAdd("Comm::ConnOpener::DelayedConnectRetry", ...)
>>
>> More changes would be needed to make this work correctly:
>>
>> * the timeout handler should not be associated with the socket
>> descriptor (since it may change). We can use eventAdd instead.
>>
>> * the Comm::ConnOpener::DelayedConnectRetry() wrapper should call a
>> method that will open a new socket.
>
> Sounds okay to me.
Received on Mon Sep 10 2012 - 15:07:00 MDT

This archive was generated by hypermail 2.2.0 : Tue Sep 11 2012 - 12:00:05 MDT