Re: [PATCH] Cleanup comm connection setup layering

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 24 Jun 2010 17:22:24 -0600

On 06/09/2010 05:44 AM, Amos Jeffries wrote:
> (since last submission I've implemented most of Alex suggestions

At the design level, the following problems need to be addressed, IMO:

Connection dual-use: Replace unconnected connections with some sort of
connection destination class and use it to configure the connection
opener (ConnectStateData) object for solo and paths cases. The
connection opener should return a single working Connection object on
success (among other things).

ConnectStateData nature: The connection opener class should become an
AsyncJob and treated as such because it is.

Either separate or merge single- and multi-destination cases in
ConnectStateData: The current code still merges some but not all
"single-destination" code with "handling one of the multi-destinations"
code, resulting in bugs. We should fully merge the two similar cases.

Specific questions, suggestions, and lower-level comments are below.

> Peer selection has been extended to do DNS lookups on the peers chosen
> for forwarding to and produce a vector<> of possible connection
> endpoints (squid local IP via tcp_outgoing_address or tproxy) and remote
> server.
>
> ConnectStateData has been moved into src/comm/ (not yet namespaced) and
> had all its DNS lookup operations dropped. To be replaced by a looping
> process of attempting to open a socket and join a link as described by
> some Comm::Connection or vector<> of same.
> Limited by connect_timeout in its running time.
> Extended by connect_retries number of attempts to open each path.

Can ConnectStateData handle DNS lookups? Does the peer selection
algorithm need tight control over those lookups before we try to open a
connection to a peer?

> ConnectStateData::connect() will go away and do some async work. Will
> come back at some point by calling the handler with COMM_OK,
> COMM_ERR_CONNECT, COMM_TIMEOUT and ptrs to the Comm::Connection or
> vector (whichever were passed in).
> On COMM_OK the Comm::ConnectionPointer or the first entry of the vector
> will be an open Comm::ConnectionPointer which we can now use.
> On COMM_ERR_CONNECT the vector will be empty (all tried and
> discarded), the single Comm::ConnectionPointer will be !isOpen() or NULL.
> On COMM_TIMEOUT their content is as per COMM_ERR_CONNECT but the vector
> may have untried paths still present but closed.
>
> FD opening, FD problems, connection errors, timeouts, early remote
> TCP_RST or NACK closure during the setup are all now wrapped out of
> sight inside ConnectStateData.

Who is responsible for closing the opened connection descriptor if the
requestor disappears?

The old requestor code could close the FD on job exit because it had
access to the descriptor. With the new code, the descriptor should not
be available to the requestor until the connection is established.

[ I would appreciate if we do not argue exactly which requestors may
disappear -- it is bound to happen sooner or later, and I am tired of
fixing hard-to-find bugs in the code that simply assumed otherwise
because it seemed safe at the time. Let's at least write the new code
the right way. If it is an async job, we should never assume it is
around after AsyncStart. ]

* Comm::Connection assignment operator does not handle assignment to
self. Please either fix the operator or make the assignment impossible
(if it is not needed).

* Comm::Connection assignment operator should return non-const reference
to follow typical C/C++ assignment expectations.

* Do we really need to support Comm::Connection copying and assignment?
Seems like a very dangerous operation given the fact that Connection
closes it descriptor upon destruction. For example, the following code
is currently possible but is totally wrong:

  void foo(Connection c) { ... }

  Connection c;
  c.fd = openex...
  foo(c);

If we do not use Comm::Connection copying and assignment now, I would
strongly recommend prohibiting it. Otherwise, please point me to the
code that does.

* Do you plan on converting idle pconn pools to use Comm::Connection?

* Please use "const reference" types for Pointer methods parameters by
default. This avoids needless locking/unlocking during parameter
copying. For example,

ConnectStateData(Comm::PathsPointer paths, AsyncCall::Pointer handler)

can be optimized as

ConnectStateData(const Comm::PathsPointer &paths, const
AsyncCall::Pointer &handler)

* Do not free the host before deleting self because the destructor will
do it for you:

> callback = NULL;
> safe_free(host);
> delete this;

* The ConnectStateData code still has slightly different code executed
for solo and path cases, code where solo/path decision is tested in
slightly different ways, etc.

I have already commented on that and you responded that fixing this bug
is too inefficient. I disagree. There is no need to allocate paths if
they are not used.

If you provide a simple currentDestination() accessor for the current
path/solo destination and a goingSolo() test method, the code will not
become less efficient but may become correct or at least clearer.

Another efficient option is to have two classes. One opens a connection
for a single destination. The other one uses the first class to open a
connection for one of several destinations. This is an arguably a
cleaner design because it clearly separates two different problems at
the expense of creating two jobs instead of one.

I think on of these two changes changes must be done. I will point a few
specific problems that the lack of uniform solo/paths treatment has
caused below, but this is not an exhaustive list.

* Who creates the descriptor for the solo case if the requestor did not
open it? There is such code for the paths case but not for the solo
case. For example, comm_openex() call was removed from the ICAP code,
but I do not see the corresponding call in ConnectStateData::connect.

If paths code can create descriptors on its own, without requestor
participation, the solo code should be able to do the same.

* The assertion below seems to contradict the code logic

> /* catch the error case. */
> assert(paths != NULL && solo != NULL);

* Why does the code below is not executed for the solo case?

> if (paths != NULL && paths->size() > 0) {
> fd = (*paths)[0]->fd;
> debugs(5, 3, HERE << "FD " << fd);
> comm_remove_close_handler(fd, ConnectStateData::EarlyAbort, this);
> commSetTimeout(fd, -1, NULL, NULL);
> }

* Why does the code below is not executed for the solo case?

> if (connstart == 0) {
> connstart = squid_curtime;
> }

* Who sets the ConnectStateData::EarlyAbort handler? What is the handler
for?

* ConnectStateData is not owned by anyone, deletes self, and is
scheduling/receiving async calls. It should be converted into an
AsyncJob with proper start/done methods. As a free bonus, this will
greatly help with debugging.

* Do not pass a pointer to the data member to an async job for long-term
use. For example, when the Forward object disappears, the copy of the
&paths pointer in ConnectStateData will point to garbage:

> ConnectStateData *cs = new ConnectStateData(&paths, call);

I did not check for similar bugs elsewhere.

* You need to prohibit copy-construction, not parameter-less creation
via a default constructor:

> /* These objects may NOT be created without connections to act on. Do not define this operator. */
> ConnectStateData();

No default constructor is generated if there are explicit ones.

* The return type of the assignment operator should be a non-const
reference. This is not really a runtime bug, but please fix:

> const ConnectStateData operator =(const ConnectStateData &c);

* Why take a mini pause and schedule an zero-delay event here? Can we
just proceed with the next step instead? We are inside an async call
handler so I do not see any need to make us "more async":

> eventAdd("ConnectStateData::Connect", ConnectStateData::Connect, this, 0.0, 0);

* Here is a suggestion for the ConnectStateData description, repeated
from the previous email, just to make sure it is not lost in the noise:

/// Async-opener of a Comm connection.
/// Can find the first "working" among multiple destinations.

If you agree with the above scope, let's s/ConnectStateData/ConnOpener/g

* The new Must() below does not seem to match the new allocation code.
If the connection is not NULL, we will have a memory leak so we should
not imply that it may not be NULL in Must():

> + Must(connection == NULL || !connection->isOpen());
>
> const Adaptation::Service &s = service();
>
> if (!TheConfig.reuse_connections)
> disableRetries(); // this will also safely drain pconn pool
>
> + connection = new Comm::Connection;

* Please replace

> + // TODO: where do we get the DNS info for the ICAP server host ??
> + // Ip::Address will do a BLOCKING lookup if s.cfg().host is a ...

with

// TODO: Avoid blocking lookup if s.cfg().host is a ...

* Can you add a host() getter/setter for ConnectStateData? Not a big
deal, but it bothers me that we are duplicating so many xstrdup() calls
(all users except Ident!) and leaving the host member open to abuse.

* In the code below, can we set the connection pointer to NULL instead
of dragging it (and the connection object) around after closeConnection()?

> writer = NULL;
> reader = NULL;
> connector = NULL;
> - connection = -1;
> }
> }

* Can the new code below be moved after the Xaction state is checked and
properly synced? The current implementation leaves Xaction in an
inconsistent state: we are connecting and the connection have timed out.

> + if (io.flag == COMM_TIMEOUT) {
> + handleCommTimedout();
> + return;
> + }
> +
> Must(connector != NULL);
> connector = NULL;

* Note how the dual-use Connection prompts you to allocate and then get
rid of a "temporary" connection in the ICAP code below:

> + connection = io.conn;

This waste would not exist if we did not use Connection objects as
destination detail holders.

* Adaptation::Icap::Xaction::fillPendingStatus and
Adaptation::Icap::Xaction::fillDoneStatus will now crash when connection
pointer is NULL. Please fix.

* Please move (connection != NULL && connection->isOpen()) into a
dedicated haveConnection() const method instead of duplicating the same
code many times.

* Xaction is not the only ICAP code using the connection member. Please
fix all the affected code. For example:

> void Adaptation::Icap::ModXact::startReading()
> {
> Must(connection >= 0);

* Please shorten the comment below to "///< ICAP server connection" even
though it is currently not very accurate if we are not reusing a pconn
and have not connected yet (due to dual-use of Connection).

> - int connection; // FD of the ICAP server connection
> + Comm::ConnectionPointer connection; // Handle to the ICAP server connection

* Does connect_retries apply to ICAP connections? If yes, the
"forwarding" description below is a little misleading.

> + This sets the maximum number of connection attempts for each
> + potential host address selected by forwarding.

Can we clarify how one can get multiple potential host addresses here?
And whether we are talking about multiple domain names or multiple IP
addresses?

* Not a big deal, but I would rather see us _removing_ such personal
artifacts than adding more of them:

> + * AUTHOR: Amos Jeffries
> + * Copyright (c) 2010, Amos Jeffries <amosjeffries_at_squid-cache.org>

Technically, we can all add our names to the majority of Squid files
because any active developer added at least something there, but it only
wastes ink and time, IMO.

* Open paren in the comment below. Why not use C++ // one-line comments
for one line comments?

> /* set the new one (unless it is NULL */

* Please remove Comm::PathsPointer typedef. It is very misleading
because we name refcounting pointers that way. Most of my review, I
assumed it is a refcounting pointer... I probably missed bugs because of
that.

If you do want to have a typedef for Paths pointer (even though the
typedef name is longer than using the raw pointer type!), consider
something like

    typedef Paths* PathsRawPtr;

I have not fully reviewed all the code but I hope the above is
sufficient to help you move forward. Please CC me if you post a new
version for review.

Thank you,

Alex.
Received on Fri Jun 25 2010 - 00:42:07 MDT

This archive was generated by hypermail 2.2.0 : Fri Jun 25 2010 - 12:00:08 MDT