Re: [PATCH] Do not send unretriable requests on reused pinned connections

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Sat, 01 Dec 2012 13:03:11 -0700

On 11/30/2012 08:27 PM, Amos Jeffries wrote:
> On 1/12/2012 1:31 p.m., Henrik Nordström wrote:
>> fre 2012-11-30 klockan 15:30 -0700 skrev Alex Rousskov:
>>
>>> Squid is sending POST requests on reused pinned connections, and
>>> some of those requests fail due to a pconn race, with no possibility for
>>> a retry.
>> Yes... and we have to for NTLM, TPROXY and friends or they get in a bit
>> of trouble from connection state mismatch.
>>
>> If sending the request fails we should propagate this to the client by
>> resetting the client connection and let the client retry.
>
> It seems to me we are also forced to do this for ssl-bump connections.

The current code does not force it. From user experience point of view,
resetting the client connection is usually the wrong choice in this
context (it breaks sites/transactions that work correctly with the
current code). Thus, if we disagree whether this should be the default
behavior, somebody will need to add a knob to control it from squid.conf.

> * Opening a new connection is the wrong thing to do for server-first
> bumped connections, where the new connection MAY go to a completely
> different server than the one whose certificate was bumped with. We
> control the IP:port we connect to, but we cannot control IP-level load
> balancers existence.

Not exactly. Opening a new connection MIGHT be wrong (not IS wrong). In
all real use cases we know of today, it is actually the right thing to
do. So far, we have found no cases where a load balancer or a similar
device sends the request to a semantically different server. I am not
saying such cases do not exist, but they are definitely not the norm.

Please note that Squid checks the certificate of the new server when the
connection is repinned. We do not just trust that the new server
certificate is going to be similar. IIRC, this has been discussed in the
past, when bump-server-first patches were reviewed.

> * client-first bumped connections do not face the lag, *BUT* there is
> no way to identify them at forwarding time separately from server-first
> bumped.

The lag is not the primary issue here. Pconn races are. All SslBump
connections suffer from those.

We can discuss the lag separately. Since reconnecting after 1 second lag
may face the same lag, it may not really help. I would not object to
adding an optional feature to reconnect after X second lag, but I
recommend waiting for a real use case to justify that complication (and
to help us detail the fix correctly).

> * we are pretending to be a dumb relay - which offers the ironclad
> guarantee that the server at the other end is a single TCP endpoint (DNS
> uncertainty is only on the initial setup. Once connected packets reach
> *an* endpoint they all do or the connection dies).

Conceptually, yes. In practice, "we break this popular site" reasons
require us to make implementation adjustments to that nice theoretical
concept of a dumb relay.

> We can control the outgoing IP:port details, but have no control over
> the existence of IP-level load balancers which can screw with the
> destination server underneath us. Gambling on the destination not
> changing on an HTTPS outbound when retrying for intercepted traffic will
> re-opening at least two CVE issues 3.2 is supposed to be immune to
> (CVE-2009-0801 and CVE-2009-3555).

We do not gamble. We verify that the new destination has a similar
certificate.

> Races are also still very possible on server-bumped connections if for
> any reason it takes longer to receive+parse+adapt+reparse the client
> request than the server wants to wait for.

Yes, but that would not be a pconn race. Let's not enlarge the scope of
the problem in the context of this patch review. The proposed patch
fixes pconn races (with real use cases behind the fix!). It does not
pretend to fix something else.

> A straight usage counter is deftinitely the wrong thing to use to
> control this whether or not you agree with us that re-trying outbound
> connections is safe after guaranteeing teh clietn (with encryption
> certificate no less) that a single destinatio has been setup. What is
> needed is a suitable length idle timeout and a close handler.

The usage counter is the right approach to detect pconn races, by
definition: A pconn race happens _after_ the first HTTP connection use.
The counter simply helps us detect when we have used the HTTP connection
at least once.

> The issue is not that the conn was used then pooled versus pinned. The
> issue is that async period between last and current packet on the socket
> - we have no way to identify if the duration between has caused problems
> (crtd, adaptation or ACL lag might be enough to die from some race with
> NAT timeouts). Whether that past use was the SSL exchange (server-bump
> only) or a previous HTTP data packet.

I am not trying to fix all possible race conditions. The patch is trying
to fix just one: an HTTP pconn race. Other fixes would be welcomed, of
course, especially if backed by real use cases/need.

> I agree this is just as much true
> on bumped connections which were pinned at some unknown time earlier as
> it is for connections pulled out of a shared pool and last used some
> unknown time earlier. Regardless of how the persistence was done they
> *are* essentially reused idle persistent connections. All the same
> risks/problems, but whether retry or alternative connection setup is
> possible differs greatly between the traffic types - with intercepted
> traffic (of any source) the re-try is more dangerous than informing the
> client with an aborted connection.

Retry works in the use cases we know about (and there are many of
them!). Aborting the client does not. Retry is in the current code. If
somebody wants to make the behavior configurable, let them do it,
especially if there is some real world need for it, but that is a
completely separate issue from the proposed fix.

The proposed fix does not introduce retries/reconnects for bumped
connections. It just avoids sending unretriable requests on idle pconns.

Hope this clarifies,

Alex.
Received on Sat Dec 01 2012 - 20:03:17 MST

This archive was generated by hypermail 2.2.0 : Sun Dec 02 2012 - 12:00:08 MST