Re: [PATCH] Move timeout handling from ConnOpener::connect to ConnOpener::timeout

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 17 Jan 2013 14:23:21 -0700

On 01/17/2013 10:56 AM, Rainer Weikusat wrote:
> Alex Rousskov writes:
>> On 01/16/2013 09:58 PM, Amos Jeffries wrote:
>>> * timeout() happens first and runs doneConnecting() while connect() is
>>> queued.
>>
>> Yes (timeout wins and ends ConnOpener job).
>
> It is possible to 'do better' here fairly easily, see more detailed
> explanation below.

No, it is not. It is possible to spend a lot of time on writing,
reviewing, and fixing complicated code that will optimize the case that
virtually never happens. Correctly detecting this race and properly
handling it is not easy and is not useful.

>>> - we have no handle to the connect() AsyncCall stored so we can't
>>> cancel it directly (bug?).
>>
>> We do not need to cancel it (just like with the timeout call discussed
>> above). The async call will not be delivered to the destroyed ConnOpener
>> job).
>
> I'm rather in favour of cancelling the calls, at least where easily
> possible, because this makes the debugging/ diagnostic output more
> easy to understand.

We should, of course, clear the stored timeout (or I/O) callback after
that callback has been fired. Without that, swanSong() would try to
clear a fired callback, and that would be confusing.

I do not have the guts to object to us calling the cancel() method for
the non-fired callback that is no longer relevant (as long as it is done
consistently). However, I do not see how that improves debugging in
ConnOpener case because the callback will know that the job is gone
without our explicit cancellation. And I am sure somebody will
eventually think that calling cancel has some useful side-effects such
as closing file descriptors or freeing some memory (it does not).

>> Ending the job is enough to guarantee that it will not be called via an
>> async job call:
>
> Taking this into account and after having spent some more time
> thinking on this, I think what the timeout handler should do is,
>
> 1. Clear calls_.timeout_.

Yes. calls.timeout must reflect whether there is a timeout notification
scheduled (from our job point of view).

> 2. Check if temporaryFd_ is still valid. At the moment, this is always
> the case, but it would prudent wrt general 'defensive programming' and
> the check will become necessary for change 3|.

I do not know what valid is in this context, but temporaryFd_ and
earlyAbort must reflect whether there is a Comm I/O scheduled (from our
job point of view).

> 2a. If this was the case, check if the write handler of the
> corresponding fde is still set.

If I/O is scheduled, we should stop it, close the descriptor, and free
memory using Comm APIs. This condition should probably be detected only
in swanSong(), to avoid code duplication. The job should exit after a
timeout, of course.

Accessing Comm write handler pointers in some low-level table should be
avoided if at all possible. If Comm APIs need to be adjusted to handle
this better, we can adjust them.

Amos, do you remember what "partially open FD" in the following comment
meant, exactly? Why cannot we just call comm_close() to cleanup?

> // it never reached fully open, so cleanup the FD handlers
> // Note that comm_close() sequence does not happen for partially open FD

In your latest patch, please move fd cleanup from timeout() and from
doneConnecting() into a dedicated method and call that method from
swanSong() if temporaryFd_ is still set there. Do not call it from
doneConnecting() or anywhere else. Remember that a job may end for
reasons other than some timeout or I/O. swanSong() is the cleanup method.

Please do not forget to set calls.earlyAbort to NULL after canceling it
in the fd cleanup method.

> + ptr = static_cast<Pointer *>(F->write_data);
> + delete ptr;

This is not going to work well. F->write_data now points to deleted
memory. If you absolutely have to do this, please set it to NULL after
delete.

Thank you,

Alex.
Received on Thu Jan 17 2013 - 21:23:33 MST

This archive was generated by hypermail 2.2.0 : Fri Jan 18 2013 - 12:00:07 MST