Re: server-side pconn races no longer retried after r10057.1.8

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Wed, 21 Sep 2011 08:27:10 -0600

On 09/20/2011 11:43 PM, Amos Jeffries wrote:
> On 21/09/11 16:56, Alex Rousskov wrote:
>> On 09/20/2011 05:57 PM, Amos Jeffries wrote:
>>>
>>> Basing it on the error type/code/something would be better.
>>>
>>> There are three sets of cases feeding into this function (maybe more if
>>> other races exist).
>>> a) server error responses 5xx/403. new path required before
>>> connectStart().
>>> b) TCP connection failures. new path required before connectStart().
>>> c) this pconn race. same path can be passed back to connectStart()

>>> At this point it should have (local.GetPort()> 0) from a previously
>>> open connection (c or a) [NP: fd will be -1 already]. So checking for
>>> the zero-sized error AND a present local port we could unset the local
>>> port, and call connectStart() without the shift().
>>
>> A positive local.GetPort() check only tells us whether we had used a
>> connection, not whether we had reused a persistent connection, right?
>
> Yes it tells us the conn was previously successful at opening. So I
> believe it should be the distinguishing difference between (a) and (b)
> cases. The ZERO_SIZED error differentiates between (c) and (a|b).

Not exactly. I believe a ZERO_SIZED error does not indicate a pconn
race. It implies that there was a connection, so a positive
local.GetPort() check would not even be needed if we have ZERO_SIZED
error, but it is insufficient to indicate a pconn race.

>> How about adding an explicit flag.avoidPconn[NextTime] or similar and
>> setting it as sketched below?
>>
>> FwdState::connectStart():
>> ...
>> fwdPconnPool->pop(..., !flag.avoidPconn&& checkRetriable());
>> flag.avoidPconn = we are reusing pconn;
>
> The problem was the last pconn, we may have N perfectly fine ones to use
> on the list.

Correct. However, we may also have N bad ones (imagine a server that was
just rebooted, silently dropping all our N pconns) and we do not want to
iterate any of them to find out: The user suffered (has been waiting)
enough already. Other users will test those N-1 remaining pconns, if any.

Moreover, even RFC 2616 implies that one should not reuse a second pconn
for the same request if the first one failed:

> Client software SHOULD reopen the
> transport connection and retransmit the aborted sequence of requests
> without user interaction

(reopen implies opening a new connection)

> The automatic retry SHOULD NOT
> be repeated if the second sequence of requests fails.

(the above implies that if the second pconn fails due to the same benign
reason, we SHOULD NOT retry and would have to server the error to the
user; we want to avoid that possibility)

>> And any time we change destination:
>>
>> serverDestinations.shift();
>> flag.avoidPconn = false;
>>
>> As the above sketch illustrates, the same flag can then be used in
>> FwdState::connectStart() to _avoid_ using a pconn if we are retrying the
>> same destination.
>>
>> I do not like that we would have to remember to reset this flag every
>> time we shift the serverDestinations array. Perhaps that logic should be
>> moved into a dedicated FwdState method.
>
> Yes, a flag is possible if we need to go that way. I think we can get
> the same information from local port>0 for free without any state
> maintenance overheads though. And we only need to use it at this one
> point where the cases are clear.

My understanding is that local port>0 does not mean we have used a
pconn, with or without a ZERO_SIZED error. Please correct me if I am wrong.

Thank you,

Alex.
Received on Wed Sep 21 2011 - 14:27:57 MDT

This archive was generated by hypermail 2.2.0 : Wed Sep 21 2011 - 12:00:03 MDT