Re: server-side pconn races no longer retried after r10057.1.8

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 21 Sep 2011 11:57:09 +1200

 On Tue, 20 Sep 2011 16:35:02 -0600, Alex Rousskov wrote:
> Hello,
>
> I am getting ERR_ZERO_SIZE_OBJECT during v3.2 performance tests.
> I
> believe it is caused by Squid server-side failing when the server
> closes
> a persistent connection before reading our second request. We used to
> retry in those cases, as RFC 2616 prescribes.
>
> It looks like we stopped retrying after the following change:
>
>> revno: 10057.1.8
>> branch nick: cleanup-comm
>> timestamp: Wed 2010-05-19 23:28:21 +1200
>> message:
>> Comm restructure part 2 - outbound connections
>
> The relevant change in FwdState::retryOrBail() logic is below. First,
> we
> unconditionally remove the server already tried:
>
>> + paths.shift(); // last one failed. try another.
>
> Then we are retrying only using the next hop, if any:
>
>> + if (paths.size() > 0) {
> ... retry using the next path ...
>> + }
>> + // else bail. no more paths possible to try.
>
> The old code used to retry using the current server if there were no
> other servers to try.
>
> When we are dealing with a pconn race condition, we should not throw
> the
> current server/path/next hop/whatever away. We should retry by
> opening a
> fresh connection the same server/path/next_hop/whatever.
>
> [ I suspect the old code was not exactly doing the right thing when
> multiple servers were present either, but it did work correctly in a
> typical "single server" case when a pconn race was encountered. ]

 The current code was not exactly a change in this area. Retrying the
 server was not actually doing anything previously either. The full
 peer-select cycle was repeating on every retry. Which reset the state
 about what had already been tried and hid a few of these cases.

>
> A simple change fixing pconn retrials in the new code may go along
> these
> lines:
>
> if (paths.size() > 1) // if we can
> paths.shift(); // try another.
> // else retry using the same path
>
> but I am not sure whether (a) that would be OK for all other kinds of
> retries and (b) would not cause excessive retries since paths will
> never
> be emptied in retryOrBail(). Is there a better way to fix this?

 Hmm, only if we have no other choice. In the worst-case this will make
 all other CANNOT_FORWARD cases "hang" until forward_timeout has passed.

 Basing it on the error type/code/something would be better.

 There are three sets of cases feeding into this function (maybe more if
 other races exist).
  a) server error responses 5xx/403. new path required before
 connectStart().
  b) TCP connection failures. new path required before connectStart().
  c) this pconn race. same path can be passed back to connectStart()
 _once_.

 Cycles of repetition trying the same path on any of the (a) and (b)
 cases is a very bad idea for performance.

 At this point it should have (local.GetPort() > 0) from a previously
 open connection (c or a) [NP: fd will be -1 already]. So checking for
 the zero-sized error AND a present local port we could unset the local
 port, and call connectStart() without the shift().

 Amos
Received on Tue Sep 20 2011 - 23:57:12 MDT

This archive was generated by hypermail 2.2.0 : Wed Sep 21 2011 - 12:00:03 MDT