Re: [PATCH] Move timeout handling from ConnOpener::connect to ConnOpener::timeout

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 17 Jan 2013 17:58:06 +1300

On 17/01/2013 7:31 a.m., Alex Rousskov wrote:
> On 01/15/2013 07:39 AM, Rainer Weikusat wrote:
>>>> internal method named 'stillConnecting' is introduced which
>>>> contains the check for an 'abort done by the parent' which is presently
>>>> in ConnOpener::connect,
>>>>
>>>> // our parent Jobs signal abort by cancelling their callbacks.
>>>> if (callback_ == NULL || callback_->canceled())
>>>> return;
>>>>
>>>> This method is called from both ::connect and ::timeout to determine
>>>> if the state of the connect request is still undecided by the time
>>>> either of both methods executes. This guards against the case that
>>>> both ::connect and ::timeout are scheduled before one of them
>>>> actually runs and detects when the 'other side' of a full
>>>> connection has already gone away or is in the process of going away
>>>> (and hence, the already existing 'result state' should be kept).
>>> The stillConnecting() mechanism should not be needed at all. The
>>> doneAll() method tests should be what is used to determine whether
>>> the Job is still running or not.
>> These two lines of code are presently at the top of
>> ConnOpener::connect. If it should really be ConnOpener::doneAll there,
>> than, the new method is obviously redundant, but that's something I
>> don't know (since I didn't wrote the code containing the two slightly
>> different 'still something to do' checks and if in doubt, I'd rather
>> assume that there was a reason for that).
>
> Amos, the done() protection does not replace stillConnecting() because
> the job does not know yet that it is done. The automatic done() check
> happens _after_ the async call into the job, not before that call.
> [Perhaps done() check should be performed before and after, but we
> currently do it after because, in part, some callbacks have side effects
> not covered by done() that require us to call them.]

There is a significant overlap between the ConnOpener doneAll() and the
new stillConnecting().

Whet I was trying to point out is that we have three race conditions
between timeout() and connect() Calls...

* connect() happens first and runs re-schedules while timeout() is queued.
   - this is the case we are now ignoring. I'm fine with that.

* connect() happens first and runs doneConnection() while timeout() is
queued.
  - in this case doneConnecting() will cancel the timeout() queued Call.
The Dialer skips cancelled Calls correct? so we have no need to check if
we are still connecting because we have to be for the timeout Call to be
non-canceled. Therefore use of stillConnecting() is a waste of cycles in
timeout().

  * timeout() happens first and runs doneConnecting() while connect() is
queued.
  - we are still connecting because the connect() Call has not run.
  - we have no handle to the connect() AsyncCall stored so we can't
cancel it directly (bug?).
  - Therefore the connect() call when it is run must still check for
doneAll() details (Dialer did not do it beforehand as you say). However,
note that the previous timeout() Call resulted in doneConnecting()
clearing all the state and doneAll() after the earlier timeout()
presented true.

So we have a choice of:
  a) hold a pointer in the calls_ struct to the latest connect() call
queued. And cancel it in doneConnecting() or when in-progress schedules
a new one.
or
  b) status-quo: check for doneAll() explicitly inside
ConnOpener::connect() function and abort/return if it says the job
should have exited already.

>
> Rainer, the stillConnecting() check should not be needed. Nothing should
> break if the code requesting a connection has quit while we were
> waiting. The bzr log message for adding that check is "Save come CPU
> cycles scheduling useless calls". Personally, I suspect we might be
> losing more cycles on this useless check than on an occasional useless
> call, but the common case may be environment-dependent.

It prevents connect() function scheduling further connect() Calls after
timeout() has already sent the callback and the related Job is dead/dying.

>
> Amos, are you OK with us removing that optimization during this rewrite?

No. I'm not clear on what happens if the Job is supposedly exiting by
one of the Calls keeps re-scheduling itself as an I/O handler. The need
for actually adding the optimization in the first place was IIRC that we
had a lot of Calls scheduled disregarding the Job status and hogging
memory the whole time.

I'm more in favour of this patch doing the (a) option above which will
make it clear that all Calls are cancelled explicitly and nothing bad
could possibly go wrong.

Amos
Received on Thu Jan 17 2013 - 04:58:14 MST

This archive was generated by hypermail 2.2.0 : Thu Jan 17 2013 - 12:00:08 MST