Re: [PATCH] Bug 3281

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Fri, 02 Sep 2011 18:29:58 +1200

On 02/09/11 04:55, Alex Rousskov wrote:
> On 09/01/2011 05:47 AM, Amos Jeffries wrote:
>> This patch seeks to avoids race conditions between the AsyncCalls for
>> the pconn monitoring handlers and lookups pulling pconn up for use.
>
> ...
>
>> It adds checks for the timeout handler, as discussed in bug 3281,
>> preventing the related type of race from the timeout handler.
>
>> + // our connection timeout handler is scheduled to run already. unsafe
>> + if (fd_table[conn->fd].timeoutHandler == NULL)
>> + return false;
>
> I am not strongly against this, but it would be more elegant and may be
> more efficient to let the about-to-timeout connections to be reused
> instead. This can be safely done, for example, by canceling the pending
> timeout notification callback in the connection giving code.

Cancelling calls here would mean re-writing IdleConnList to store those
calls to be cancelled. As it stands right now once scheduled there is no
code outside the AsyncEngine with a pointer to them.
  This patch would still be needed after all that change but would
cancel and return true instead of just returning false.

I'll post a new patch shortly. With isAvailable() protecting against
unsafe re-use cases the handlers on both read and timeout can now
unconditionally cleanup and close the idle conn. Read callback may not
even need to bother about the CLOSING flag, but I think the forced
fd_table[] close cycle on shutdown (bug 2110) might still cause that case.

>
> Thank you,
>
> Alex.
> P.S. Higher level rant: We have an increasing amount of code devoted to
> checking whether some Comm information is already "in flight" and trying
> to predict what it will be or trying to react to it before it comes to
> us. Most of that code accesses internal Comm structures. I understand
> that this code is being created to address specific problems, but the
> direction of this worries me.
>
> We need to think of a better (uniform and encapsulated) way to handle
> the asynchronous nature of Comm-Job relationship or we will end up with
> too many similar-but-different and half-working checks in all Comm-using
> Jobs. We have been there before with the old spaghetti code (for
> slightly different reasons); let's try not to fall into the same trap.
>
> I do not have a well-formulated suggestion for the above high-level
> problem, but I may be able to spend some time on it after Rock Store is
> in. Meanwhile, every time you write fd_table[] outside of Comm, please
> pause and ask yourself whether there is a better way of doing it :-).

Agreed, I'm pretty sure accessing fd_table all over is one part of the
CPU cache busting performance problem we have.

I was experimenting with this earlier. The best solution does seem to be
doing pro-active and rigorous cancel() everywhere relevant. Which means
speeding up the SourceLayout conversion to use polished Async everywhere.
  Doxygen can help us there telling where each member of fd_table is
used. (provided we have set it to auto-define ALL the component macros)

  *** No more wrapper transitional functions. Code which requires them
does not have access to the Call to cancel() it or enough Async support
to be optimized well. If you look closely, that is the code leading to
these races more often than not. Followed by the upgraded code like
IdleConnList which forgets the calls its registered.

I suspect, but am not sure, that if the cancel() code for AsyncCall was
also to get their Params to drop any RefCount pointers etc it has, we
could reduce the in-flight RAM usage a bit for essentially free.
  That would also shift CPU time from the relevant destructors out of
the AsyncEngine where it drops the cancelled calls, to the code which
was cancelling. Possibly giving us a bit better performance measure.
  Looks like a major deep change though.

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.15
   Beta testers wanted for 3.2.0.10
Received on Fri Sep 02 2011 - 06:30:05 MDT

This archive was generated by hypermail 2.2.0 : Sat Sep 03 2011 - 12:00:03 MDT