Re: ICAP connections under heavy loads

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 06 Sep 2012 09:40:58 -0600

On 09/05/2012 05:48 PM, Amos Jeffries wrote:
> On 06.09.2012 02:30, Alex Rousskov wrote:
>> On 09/05/2012 03:32 AM, Alexander Komyagin wrote:
>>> On Tue, 2012-09-04 at 09:16 -0600, Alex Rousskov wrote:
>>
>>>> Again, I hope that this trick is not needed to solve your problem,
>>>> and I
>>>> am worried that it will cause more/different problems elsewhere. I
>>>> would
>>>> recommend fixing CommOpener instead. If _that_ is not sufficient, we
>>>> can
>>>> discuss appropriate low-level enhancements.
>>>
>>> Both things shall be fixed, IMHO.
>>
>> If double-connect is not needed, we should not introduce it. AFAICT,
>> double-connect is an attempt to cope with a bug in higher-level code. We
>> should fix that bug and see if that is sufficient. If it is not
>> sufficient, we should evaluate why and only then add appropriate
>> low-level code if needed.
>
> It's there to fix an async race condition of Squid vs TCP:

I assume by "it" you mean the existing ConnOpener code rather than the
proposed comm.cc changes.

> 1) the timeout occurs ... timeout AsyncCall is scheduled.
> 2) the socket TCP completes ... completed AsyncCall is scheduled.
> 3) timeout call is handled:
> a) if FD is ready use it, terminate the job early as a success (side
> effect: cancel the pending completed Call)
> b) else, terminate the job early as a fail.
> 4) connection completed call is dropped is handled.

If the above optimization(?) is the only reason for the current
ConnOpener behavior, then we should remove this complexity. I see no
good reason to handle this race condition specially -- the ConnOpener
timeout handler should inform the caller about the error and close. This
approach would be:

  * simple
  * correct (it honors the timeout setting)
  * does not slow down the 99.99% of cases where there is no race

Compare this to the current code:

  * complex
  * buggy (this email thread is essentially a bug report)
  * does not slow down the 99.99% of cases where there is no race

Or with the proposed code (after it is polished):

  * complex
  * correct
  * slows down the 99.99% of cases where there is no race
  * may add portability headaches if we trust Stevens opinion

> If you can find a cleaner way to rewrite the ConnOpener::connect() and
> comm_connect_addr() call chain for the timeout case it should be
> duplicated and polished into the timeout() handler.

Yes, what I proposed (timeout handler times out; connect method
connects) is a cleaner way. However, I do not think it requires any code
duplication. In fact, it will _remove_ some code and make the remaining
code simpler.

Do we have any evidence that these connect/timeout races are both (a)
happening frequently in typical environments and (b) are best handled by
establishing the connection? If the races are frequent and the admin
wants the timedout connections to be established, should not the timeout
be increased?

N.B. It is questionable that Comm itself should even allow such races.
Comm may either cancel I/O handlers (with an ERRor) for timed out
descriptors or fire all timeout callbacks before selecting for I/O, but
this is a bigger question/change than the one discussed here.

HTH,

Alex.
Received on Thu Sep 06 2012 - 15:41:07 MDT

This archive was generated by hypermail 2.2.0 : Thu Sep 06 2012 - 12:00:07 MDT