Re: [PATCH] Cleanup comm connection setup layering

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Fri, 25 Jun 2010 16:21:11 +1200

Alex Rousskov wrote:
> On 06/09/2010 05:44 AM, Amos Jeffries wrote:
>> (since last submission I've implemented most of Alex suggestions
>
>
> At the design level, the following problems need to be addressed, IMO:
>
> Connection dual-use: Replace unconnected connections with some sort of
> connection destination class and use it to configure the connection
> opener (ConnectStateData) object for solo and paths cases. The
> connection opener should return a single working Connection object on
> success (among other things).

Um, one of the middling-important goals here is to reduce copying of the
comm data properties from one alloc'd object to another and stack.
Preserving the handles (refcount pointer) already stored by callers
(FTP, DNS for starters).
  I'm happy to make a typedef to differentiate the scopes for
readability. Though it seems to me that will just make more complexity.

>
> ConnectStateData nature: The connection opener class should become an
> AsyncJob and treated as such because it is.

Okay. Underway.

>
> Either separate or merge single- and multi-destination cases in
> ConnectStateData: The current code still merges some but not all
> "single-destination" code with "handling one of the multi-destinations"
> code, resulting in bugs. We should fully merge the two similar cases.
>
>
> Specific questions, suggestions, and lower-level comments are below.
>

Done. Dropped multi-path code from comm layer.
Forwarding has retryOrBail logics that are used to restart an early
aborted paths. Just re-using that to loop over the vector for
single-path opening attempts looks okay.
Building and will need to re-test later.

>
>> Peer selection has been extended to do DNS lookups on the peers chosen
>> for forwarding to and produce a vector<> of possible connection
>> endpoints (squid local IP via tcp_outgoing_address or tproxy) and remote
>> server.
>>
>> ConnectStateData has been moved into src/comm/ (not yet namespaced) and
>> had all its DNS lookup operations dropped. To be replaced by a looping
>> process of attempting to open a socket and join a link as described by
>> some Comm::Connection or vector<> of same.
>> Limited by connect_timeout in its running time.
>> Extended by connect_retries number of attempts to open each path.
>
> Can ConnectStateData handle DNS lookups? Does the peer selection
> algorithm need tight control over those lookups before we try to open a
> connection to a peer?

No, ConnectStateData doing its own DNS was the v4/v6 block on
split-stack IPv6 support for BSD. It blocks outgoing address selection
based on higher scoped HttpRequest details.

Pre-seeded peer DNS results would probably be a better performance boost
than doing new lookups each time. I've not found it to be a particularly
slow lookup, since the peers are most certain to be static or constantly
looked up the DNS is kept fresh in ipcache.

>
>> ConnectStateData::connect() will go away and do some async work. Will
>> come back at some point by calling the handler with COMM_OK,
>> COMM_ERR_CONNECT, COMM_TIMEOUT and ptrs to the Comm::Connection or
>> vector (whichever were passed in).
>> On COMM_OK the Comm::ConnectionPointer or the first entry of the vector
>> will be an open Comm::ConnectionPointer which we can now use.
>> On COMM_ERR_CONNECT the vector will be empty (all tried and
>> discarded), the single Comm::ConnectionPointer will be !isOpen() or NULL.
>> On COMM_TIMEOUT their content is as per COMM_ERR_CONNECT but the vector
>> may have untried paths still present but closed.
>>
>> FD opening, FD problems, connection errors, timeouts, early remote
>> TCP_RST or NACK closure during the setup are all now wrapped out of
>> sight inside ConnectStateData.
>
> Who is responsible for closing the opened connection descriptor if the
> requestor disappears?
>
> The old requestor code could close the FD on job exit because it had
> access to the descriptor. With the new code, the descriptor should not
> be available to the requestor until the connection is established.
>
> [ I would appreciate if we do not argue exactly which requestors may
> disappear -- it is bound to happen sooner or later, and I am tired of
> fixing hard-to-find bugs in the code that simply assumed otherwise
> because it seemed safe at the time. Let's at least write the new code
> the right way. If it is an async job, we should never assume it is
> around after AsyncStart. ]

Yes. 100% agreed.

If the requestor disappears and everything goes pear shaped the
Comm::Connection has just enough smarts to close the FD and prevent
leaks in its own destructor.

In the case where the original requestor has died then ConnectStateData
or dialer parameter details is left the only holder of the
ConnectionPointer and FD closure will be actioned when that state gets
dropped by the dialer.

We still have the case where an object holding a Comm::Connection
pointer leaks at a higher level. But there is nothing extra comm can do
about that.

>
> * Comm::Connection assignment operator does not handle assignment to
> self. Please either fix the operator or make the assignment impossible
> (if it is not needed).

Done.

>
> * Comm::Connection assignment operator should return non-const reference
> to follow typical C/C++ assignment expectations.
>

Done.

>
> * Do we really need to support Comm::Connection copying and assignment?
> Seems like a very dangerous operation given the fact that Connection
> closes it descriptor upon destruction. For example, the following code
> is currently possible but is totally wrong:
>
> void foo(Connection c) { ... }
>
> Connection c;
> c.fd = openex...
> foo(c);
>
> If we do not use Comm::Connection copying and assignment now, I would
> strongly recommend prohibiting it. Otherwise, please point me to the
> code that does.

Um. Ident needs to duplicate the received Comm::Connection before
playing with some of its fields and using the raw received data.
I'll make that a memcpy() and unset the required fields instead of
assignment copy.

>
> * Do you plan on converting idle pconn pools to use Comm::Connection?
>

Yes. pconn and pinned FD are next on the list to convert after this bit
is merged.

>
> * Please use "const reference" types for Pointer methods parameters by
> default. This avoids needless locking/unlocking during parameter
> copying. For example,
>
> ConnectStateData(Comm::PathsPointer paths, AsyncCall::Pointer handler)
>
> can be optimized as
>
> ConnectStateData(const Comm::PathsPointer &paths, const
> AsyncCall::Pointer &handler)

Methods okay I understand.
That example is a bad choice though, the constructor needs to lock
references.

>
>
> * Do not free the host before deleting self because the destructor will
> do it for you:
>
>> callback = NULL;
>> safe_free(host);
>> delete this;
>

Done.

>
> * The ConnectStateData code still has slightly different code executed
> for solo and path cases, code where solo/path decision is tested in
> slightly different ways, etc.
>
> I have already commented on that and you responded that fixing this bug
> is too inefficient. I disagree. There is no need to allocate paths if
> they are not used.
>
> If you provide a simple currentDestination() accessor for the current
> path/solo destination and a goingSolo() test method, the code will not
> become less efficient but may become correct or at least clearer.
>
> Another efficient option is to have two classes. One opens a connection
> for a single destination. The other one uses the first class to open a
> connection for one of several destinations. This is an arguably a
> cleaner design because it clearly separates two different problems at
> the expense of creating two jobs instead of one.
>
> I think on of these two changes changes must be done. I will point a few
> specific problems that the lack of uniform solo/paths treatment has
> caused below, but this is not an exhaustive list.

Now dumped the multi-path code in comm layer.

>
>
> * Who creates the descriptor for the solo case if the requestor did not
> open it? There is such code for the paths case but not for the solo
> case. For example, comm_openex() call was removed from the ICAP code,
> but I do not see the corresponding call in ConnectStateData::connect.

Um, okay I've seen what you mean. Making ConnectStateData a pure solo-path.
Forwarding has retryOrBail logics which are now used to loop over paths.
But tunnel needs some equivalent added at a later date to fix the
CONNECT retry bugs.

>
> If paths code can create descriptors on its own, without requestor
> participation, the solo code should be able to do the same.
>
>
> * The assertion below seems to contradict the code logic
>
>> /* catch the error case. */
>> assert(paths != NULL && solo != NULL);
>

Dead with paths.

>
> * Why does the code below is not executed for the solo case?
>
>> if (paths != NULL && paths->size() > 0) {
>> fd = (*paths)[0]->fd;
>> debugs(5, 3, HERE << "FD " << fd);
>> comm_remove_close_handler(fd, ConnectStateData::EarlyAbort, this);
>> commSetTimeout(fd, -1, NULL, NULL);
>> }
>

Fixed.

>
> * Why does the code below is not executed for the solo case?
>
>> if (connstart == 0) {
>> connstart = squid_curtime;
>> }

Fixed.

>
>
>
> * Who sets the ConnectStateData::EarlyAbort handler? What is the handler
> for?

Ouch. Should have been ConnectStateData::connect(). Fixed.
It's to catch the partially-open FD cases.

>
>
> * ConnectStateData is not owned by anyone, deletes self, and is
> scheduling/receiving async calls. It should be converted into an
> AsyncJob with proper start/done methods. As a free bonus, this will
> greatly help with debugging.

Sweet. Underway.

When I get my head around the Dialer parameters usage, those
ConnectCbFunPtr will die as well.

>
>
> * Do not pass a pointer to the data member to an async job for long-term
> use. For example, when the Forward object disappears, the copy of the
> &paths pointer in ConnectStateData will point to garbage:
>
>> ConnectStateData *cs = new ConnectStateData(&paths, call);
>
> I did not check for similar bugs elsewhere.
>

Fixed with the dropping of ConnectStateData::paths.

>
> * You need to prohibit copy-construction, not parameter-less creation
> via a default constructor:
>
>> /* These objects may NOT be created without connections to act on. Do not define this operator. */
>> ConnectStateData();
>
> No default constructor is generated if there are explicit ones.
>
Meh. Thanks. Done.

>
> * The return type of the assignment operator should be a non-const
> reference. This is not really a runtime bug, but please fix:
>
>> const ConnectStateData operator =(const ConnectStateData &c);
>

Done.

>
> * Why take a mini pause and schedule an zero-delay event here? Can we
> just proceed with the next step instead? We are inside an async call
> handler so I do not see any need to make us "more async":
>
>> eventAdd("ConnectStateData::Connect", ConnectStateData::Connect, this, 0.0, 0);
>

The old code said to prevent super-fast loops. I didn't change it.
Pre-async cruft probably. Changed.

>
> * Here is a suggestion for the ConnectStateData description, repeated
> from the previous email, just to make sure it is not lost in the noise:
>
> /// Async-opener of a Comm connection.
> /// Can find the first "working" among multiple destinations.
>
> If you agree with the above scope, let's s/ConnectStateData/ConnOpener/g
>

Ah agree. A nicer name too. Changed.
Though the multi-paths is now dead. Taking first line of the two.

>
> * The new Must() below does not seem to match the new allocation code.
> If the connection is not NULL, we will have a memory leak so we should
> not imply that it may not be NULL in Must():
>
>> + Must(connection == NULL || !connection->isOpen());
>>
>> const Adaptation::Service &s = service();
>>
>> if (!TheConfig.reuse_connections)
>> disableRetries(); // this will also safely drain pconn pool
>>
>> + connection = new Comm::Connection;
>
>
> * Please replace
>
>> + // TODO: where do we get the DNS info for the ICAP server host ??
>> + // Ip::Address will do a BLOCKING lookup if s.cfg().host is a ...
>
> with
>
> // TODO: Avoid blocking lookup if s.cfg().host is a ...
>

Done.

>
> * Can you add a host() getter/setter for ConnectStateData? Not a big
> deal, but it bothers me that we are duplicating so many xstrdup() calls
> (all users except Ident!) and leaving the host member open to abuse.
>

Done. Added a NULL wrapper check as well.

>
> * In the code below, can we set the connection pointer to NULL instead
> of dragging it (and the connection object) around after closeConnection()?
>
>> writer = NULL;
>> reader = NULL;
>> connector = NULL;
>> - connection = -1;
>> }
>> }

Particular references can be set to NULL if no longer needed. Just be
careful of logging, which often needs things long after the active code.

Setting an AccessLogEntry field and unsetting connection local field is
probably the way to go.

>
> * Can the new code below be moved after the Xaction state is checked and
> properly synced? The current implementation leaves Xaction in an
> inconsistent state: we are connecting and the connection have timed out.
>
>> + if (io.flag == COMM_TIMEOUT) {
>> + handleCommTimedout();
>> + return;
>> + }
>> +
>> Must(connector != NULL);
>> connector = NULL;

Inconsistent? That is the state.
We are connecting and the connection attempt has taken too long and
already closed.

If that is the wrong handler for a timed out comm action, then please
point me at the right one.

This is as far as I can get today. Will try for more later.
>
>
> * Note how the dual-use Connection prompts you to allocate and then get
> rid of a "temporary" connection in the ICAP code below:
>
>> + connection = io.conn;
>
> This waste would not exist if we did not use Connection objects as
> destination detail holders.
>
>
> * Adaptation::Icap::Xaction::fillPendingStatus and
> Adaptation::Icap::Xaction::fillDoneStatus will now crash when connection
> pointer is NULL. Please fix.
>
>
> * Please move (connection != NULL && connection->isOpen()) into a
> dedicated haveConnection() const method instead of duplicating the same
> code many times.
>
>
> * Xaction is not the only ICAP code using the connection member. Please
> fix all the affected code. For example:
>
>> void Adaptation::Icap::ModXact::startReading()
>> {
>> Must(connection >= 0);
>
>
>
> * Please shorten the comment below to "///< ICAP server connection" even
> though it is currently not very accurate if we are not reusing a pconn
> and have not connected yet (due to dual-use of Connection).
>
>> - int connection; // FD of the ICAP server connection
>> + Comm::ConnectionPointer connection; // Handle to the ICAP server connection
>
>
>
> * Does connect_retries apply to ICAP connections? If yes, the
> "forwarding" description below is a little misleading.
>
>> + This sets the maximum number of connection attempts for each
>> + potential host address selected by forwarding.
>
> Can we clarify how one can get multiple potential host addresses here?
> And whether we are talking about multiple domain names or multiple IP
> addresses?
>
>
> * Not a big deal, but I would rather see us _removing_ such personal
> artifacts than adding more of them:
>
>> + * AUTHOR: Amos Jeffries
>> + * Copyright (c) 2010, Amos Jeffries <amosjeffries_at_squid-cache.org>
>
> Technically, we can all add our names to the majority of Squid files
> because any active developer added at least something there, but it only
> wastes ink and time, IMO.
>
>
> * Open paren in the comment below. Why not use C++ // one-line comments
> for one line comments?
>
>> /* set the new one (unless it is NULL */
>
>
>
> * Please remove Comm::PathsPointer typedef. It is very misleading
> because we name refcounting pointers that way. Most of my review, I
> assumed it is a refcounting pointer... I probably missed bugs because of
> that.
>
> If you do want to have a typedef for Paths pointer (even though the
> typedef name is longer than using the raw pointer type!), consider
> something like
>
> typedef Paths* PathsRawPtr;
>
>
> I have not fully reviewed all the code but I hope the above is
> sufficient to help you move forward. Please CC me if you post a new
> version for review.
>
>
> Thank you,
>
> Alex.

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.4
Received on Fri Jun 25 2010 - 04:21:33 MDT

This archive was generated by hypermail 2.2.0 : Sat Jun 26 2010 - 12:00:07 MDT