[PREVIEW] Retry requests that failed due to a pconn race

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 13 Feb 2012 23:03:09 -0700

Hello,

    This has been discussed before (see the old thread quoted below),
and the bump-ssl-server-first feature is being bitten by this especially
badly.

The attached patch may not apply to trunk "as is" right now, but the
changes should be clearly visible, and I would like to port it there and
get it committed if there are no better ideas for the fix.

Technical summary:

The ERR_ZERO_SIZE_OBJECT errors were visible to the client when the
destination had only one address because serverDestinations.shift()
made the list of destination empty and startConnectionOrFail() failed.

The new "wasIdle" state is remembered in the Connection object instead
of FwdState itself because we need to maintain that information for a
pinned connection as well (and FwdState is not preserved between a
request that pins a connection and a request that uses the pinned
connection).

This may need more work. I am not sure the FTP and Gopher cases are
correct. The current code assumes that those protocols do not have pconn
[races].

Please review.

Thank you,

Alex.

Old thread titled "server-side pconn races no longer retried after
r10057.1.8":

On 09/21/2011 08:27 AM, Alex Rousskov wrote:
> On 09/20/2011 11:43 PM, Amos Jeffries wrote:
>> On 21/09/11 16:56, Alex Rousskov wrote:
>>> On 09/20/2011 05:57 PM, Amos Jeffries wrote:
>>>>
>>>> Basing it on the error type/code/something would be better.
>>>>
>>>> There are three sets of cases feeding into this function (maybe more if
>>>> other races exist).
>>>> a) server error responses 5xx/403. new path required before
>>>> connectStart().
>>>> b) TCP connection failures. new path required before connectStart().
>>>> c) this pconn race. same path can be passed back to connectStart()
>
>
>>>> At this point it should have (local.GetPort()> 0) from a previously
>>>> open connection (c or a) [NP: fd will be -1 already]. So checking for
>>>> the zero-sized error AND a present local port we could unset the local
>>>> port, and call connectStart() without the shift().
>>>
>>> A positive local.GetPort() check only tells us whether we had used a
>>> connection, not whether we had reused a persistent connection, right?
>>
>> Yes it tells us the conn was previously successful at opening. So I
>> believe it should be the distinguishing difference between (a) and (b)
>> cases. The ZERO_SIZED error differentiates between (c) and (a|b).
>
>
> Not exactly. I believe a ZERO_SIZED error does not indicate a pconn
> race. It implies that there was a connection, so a positive
> local.GetPort() check would not even be needed if we have ZERO_SIZED
> error, but it is insufficient to indicate a pconn race.
>
>
>
>>> How about adding an explicit flag.avoidPconn[NextTime] or similar and
>>> setting it as sketched below?
>>>
>>> FwdState::connectStart():
>>> ...
>>> fwdPconnPool->pop(..., !flag.avoidPconn&& checkRetriable());
>>> flag.avoidPconn = we are reusing pconn;
>>
>> The problem was the last pconn, we may have N perfectly fine ones to use
>> on the list.
>
> Correct. However, we may also have N bad ones (imagine a server that was
> just rebooted, silently dropping all our N pconns) and we do not want to
> iterate any of them to find out: The user suffered (has been waiting)
> enough already. Other users will test those N-1 remaining pconns, if any.
>
> Moreover, even RFC 2616 implies that one should not reuse a second pconn
> for the same request if the first one failed:
>
>> Client software SHOULD reopen the
>> transport connection and retransmit the aborted sequence of requests
>> without user interaction
>
> (reopen implies opening a new connection)
>
>> The automatic retry SHOULD NOT
>> be repeated if the second sequence of requests fails.
>
> (the above implies that if the second pconn fails due to the same benign
> reason, we SHOULD NOT retry and would have to server the error to the
> user; we want to avoid that possibility)
>
>
>
>>> And any time we change destination:
>>>
>>> serverDestinations.shift();
>>> flag.avoidPconn = false;
>>>
>>> As the above sketch illustrates, the same flag can then be used in
>>> FwdState::connectStart() to _avoid_ using a pconn if we are retrying the
>>> same destination.
>>>
>>> I do not like that we would have to remember to reset this flag every
>>> time we shift the serverDestinations array. Perhaps that logic should be
>>> moved into a dedicated FwdState method.
>>
>> Yes, a flag is possible if we need to go that way. I think we can get
>> the same information from local port>0 for free without any state
>> maintenance overheads though. And we only need to use it at this one
>> point where the cases are clear.
>
> My understanding is that local port>0 does not mean we have used a
> pconn, with or without a ZERO_SIZED error. Please correct me if I am wrong.
>
>
> Thank you,
>
> Alex.

Received on Tue Feb 14 2012 - 06:03:34 MST

This archive was generated by hypermail 2.2.0 : Tue Feb 14 2012 - 12:00:06 MST