Re: [PATCH] Move timeout handling from ConnOpener::connect to ConnOpener::timeout

From: Rainer Weikusat <rweikusat_at_mobileactivedefense.com>
Date: Thu, 17 Jan 2013 17:56:49 +0000

Alex Rousskov <rousskov_at_measurement-factory.com> writes:
> On 01/16/2013 09:58 PM, Amos Jeffries wrote:

[...]

>> * timeout() happens first and runs doneConnecting() while connect() is
>> queued.
>
> Yes (timeout wins and ends ConnOpener job).

It is possible to 'do better' here fairly easily, see more detailed
explanation below.

[...]

>> - we have no handle to the connect() AsyncCall stored so we can't
>> cancel it directly (bug?).
>
> We do not need to cancel it (just like with the timeout call discussed
> above). The async call will not be delivered to the destroyed ConnOpener
> job).

I'm rather in favour of cancelling the calls, at least where easily
possible, because this makes the debugging/ diagnostic output more
easy to understand.

[...]

> Ending the job is enough to guarantee that it will not be called via an
> async job call:

Taking this into account and after having spent some more time
thinking on this, I think what the timeout handler should do is,

General interjection: I'm also of the opinion that there are really
three 'related-but-different' issues here, namely 1| fix timeouts, 2|
move timeout invocation from 'file descriptors' to 'events' in order
to be able to do 3| fix connect_retries. Mixing these three together
is messy, especially considering the effects on SCM commits/ commit
logs and it is (according to my experience), also more difficult to do
three things correctly at the same time than to do one after another,
with the option to reason about each change based on the code already
including the previous one. Because of this, I'm convinced that these
three things should be three (semi-)independent patches.

[back to 'what the timeout handler should ...]

1. Clear calls_.timeout_. This stops doneConnecting from cancelling a
timeout after it was already executed and would be beneficial for
connect, details below.

2. Check if temporaryFd_ is still valid. At the moment, this is always
the case, but it would prudent wrt general 'defensive programming' and
the check will become necessary for change 3|.

2a. If this was the case, check if the write handler of the
corresponding fde is still set.

2a1. If not, connect was scheduled but didn't run yet. In this case,
timeout should exit w/o doing anything else as 'overwriting' a
successful connect or a more specific error with an 'enforced
ETIMEDOUT' would be kind-of stupid. In case connect later detects a
transient error and would otherwise retry (change 3|), calls_.timeout_
could be tested: If it is NULL, the timeout handler ran and retries
should stop, otherwise, they should go on.

2a2. If the write handler was still set, the memory allocated for
write_data needs to be freed in order to avoid leaking a ConnOpener
object, at least with the present code (confirmed backwards and
forwards via valgrind, ie, that the leak exists and that this
addition causes it to go away).

2b. Cancel the early_abort call because this makes the diagnostics
easier to understand and the comment in front of doneConnecting says
it should be done.

3. Issue the 'traditional' 'took too long to respond' diagnostic.

4. Invoke doneConnecting to 'signal' the timeout.

Attached is a patch which does this (in addition to moving the timeout
code to ::timeout). Presently, it is comment-free: I tend to avoid
comments because they make the code itself more difficult to read and
I've been bitten by copied'n'pasted code where people changed the code
to do something more-or-less completely different while leaving the
comments as they were often enough already (this happens even in 'big
& reputable' open source projects like Linux). I would be willing to
add the 'rationale' above, though, if deemed feasible.

Received on Thu Jan 17 2013 - 17:57:05 MST

This archive was generated by hypermail 2.2.0 : Fri Jan 18 2013 - 12:00:07 MST