Re: [PATCH] Move timeout handling from ConnOpener::connect to ConnOpener::timeout

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Wed, 16 Jan 2013 22:53:09 -0700

On 01/16/2013 09:58 PM, Amos Jeffries wrote:

> Whet I was trying to point out is that we have three race conditions
> between timeout() and connect() Calls...
>
> * connect() happens first and runs re-schedules while timeout() is queued.
> - this is the case we are now ignoring. I'm fine with that.

Yes (a failed connect wins and retries).

> * connect() happens first and runs doneConnection() while timeout() is
> queued.

Yes (a failed connect wins and ends ConnOpener job).

> - in this case doneConnecting() will cancel the timeout() queued Call.

There is no need to cancel any pending calls to ConnOpener. The async
call will simply not be delivered to the destroyed ConnOpener job. What
we do need is to cleanup anything we have allocated (that cleanup should
not be in the timeout callback or, at least, that cleanup should not be
exclusive to the timeout callback because the callback may never be called).

> * timeout() happens first and runs doneConnecting() while connect() is
> queued.

Yes (timeout wins and ends ConnOpener job).

> - we are still connecting because the connect() Call has not run.

Yes, that is possible. We should tell Comm to stop connecting (if Comm
still does), to avoid wasting resources, but everything ought to work
correctly even if we do not tell Comm.

> - we have no handle to the connect() AsyncCall stored so we can't
> cancel it directly (bug?).

We do not need to cancel it (just like with the timeout call discussed
above). The async call will not be delivered to the destroyed ConnOpener
job).

> - Therefore the connect() call when it is run must ...

it will not run. Once the job ends, no callbacks can reach the
[destroyed or invalidated] job object. The ConnOpener job should end
when we call the callback or when there is no longer need to call the
callback. If the ConnOpener job does not end under any of these
conditions, the bug is in doneAll().

> So we have a choice of:
> a) hold a pointer in the calls_ struct to the latest connect() call
> queued. And cancel it in doneConnecting() or when in-progress schedules
> a new one.
> or
> b) status-quo: check for doneAll() explicitly inside
> ConnOpener::connect() function and abort/return if it says the job
> should have exited already.

Neither should be needed AFAICT, as discussed above.

>> Rainer, the stillConnecting() check should not be needed. Nothing should
>> break if the code requesting a connection has quit while we were
>> waiting. The bzr log message for adding that check is "Save come CPU
>> cycles scheduling useless calls". Personally, I suspect we might be
>> losing more cycles on this useless check than on an occasional useless
>> call, but the common case may be environment-dependent.
>
> It prevents connect() function scheduling further connect() Calls after
> timeout() has already sent the callback and the related Job is dead/dying.

ConnOpener will not exist or will not be reachable by async calls "after
timeout() has already sent the callback". Thus, there will be no further
connect calls originated by that job.

>> Amos, are you OK with us removing that optimization during this rewrite?
>
> No. I'm not clear on what happens if the Job is supposedly exiting by
> one of the Calls keeps re-scheduling itself as an I/O handler. The need
> for actually adding the optimization in the first place was IIRC that we
> had a lot of Calls scheduled disregarding the Job status and hogging
> memory the whole time.

If that was true, the bug was elsewhere. For example, perhaps we were
not using async calls to reach the job, and so we lacked the protection
that the AsyncJob API offers.

> I'm more in favour of this patch doing the (a) option above which will
> make it clear that all Calls are cancelled explicitly and nothing bad
> could possibly go wrong.

I agree that it is tempting to cancel everything explicitly, especially
given how old code worked, but it results either in us masking bugs
(when we think we are under AsyncJob protection, but we are not, as has
happened with ConnOpener at some point long time ago IIRC) or simply
increases code complexity and wastes CPU/developer cycles (and the
useless checks then get copied, spreading complexity/waste throughout
the code).

Ending the job is enough to guarantee that it will not be called via an
async job call:

> bool
> JobDialer<Job>::canDial(AsyncCall &call)
> {
> if (!job)
> return call.cancel("job gone");
...

The (!job) condition checks that the cbdata-protected job pointer is
still valid. The done() job cannot be valid because it is destroyed at
the end of the async call that finished it:

> void AsyncJob::callEnd()
> {
> if (done()) {
> swanSong();
> delete this; // this is the only place where the object is deleted
...

After the above, there is no job and no calls will get through because
the cbdata pointer will be invalidated by "delete this".

HTH,

Alex.
Received on Thu Jan 17 2013 - 05:53:19 MST

This archive was generated by hypermail 2.2.0 : Thu Jan 17 2013 - 12:00:08 MST