Re: [PATCH] Comm cleanup (Async ConnOpener)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 21 Jul 2010 01:53:16 +1200

Alex Rousskov wrote:
> On 07/19/2010 05:26 AM, Amos Jeffries wrote:
>> Alex Rousskov wrote:
>>> On 07/18/2010 04:52 AM, Amos Jeffries wrote:
>>>> This updated my previous submission for the comm connection opener
>>>> model.
>>>>
>>>> diff since last time:
>>>>
>>>> - removed the multi-path pathways. They are now isolated at the upper
>>>> level. ConnOpener now handles only one Comm::Connection.
>>>>
>>>> - implemented ConnOpener as an AsyncJob child.
>>>>
>>>>
>>>> What has changed from HEAD:
>>>>
>>>> ConnectionDetails objects have been renamed Comm::Connection and been
>>>> extended to hold the FD and Squids' socket flags.
>>> ...
>>>> On COMM_ERR_CONNECT and COMM_TIMEOUT the conn will be !isOpen() or NULL
>>>> and should not be relied on.
>>> If the connection pointer is NULL, a job opening multiple connections
>>> would not be able to tell which connection just failed. Let's not hide
>>> connection pointer from the initiator.
>>>
>>>
>>>
>>> * Can Adaptation::Icap::Xaction::closeConnection set connection to NULL?
>> In the reuseConnection == false case, it can after the connection->close().
>
>
> Why not in both reuse and no-reuse cases?

Um... okay yes. In both. Fixed.

>
>>> * ConnOpener::start(), doneAll(), and swanSong() should remain protected
>>> as they are in the AsyncJob.
>>>
>> Done. Apart from doneAll(). It is required to be accessible to the
>> ConnectRetry wrapper until comm write callbacks are turned into methods.
>
> ConnectRetry is broken in several ways: It partially and incorrectly
> emulates async call protections, it calls start() directly, and it
> results in at least two calls to start() for on job (which is like
> having two calls to main() in a program).
>
> Instead of this mess, please have ConnectRetry wrapper to schedule an
> async call for ConnOpener::tryConnecting() or a similar job method and
> move your existing comm_connect_addr() code there. No deleteThis or
> doneAll access would be necessary.

Ew. A call to schedule a call?

We do gain a lot though. Done.

>
>>> * Should ConnOpener::swanSong() call comm_close instead of just close()?
>> No. Until ConnOpener() has completed successfully there is no other
>> valid user of the conn->fd. The only handlers to deal with are
>> ConnOpeners' earlyAbort and timeout handlers, which are explicitly
>> cancelled and removed.
>> comm_close does a lot of additional handler and callback state updating
>> over multiple callbacks. This is all unnecessary waste of cycles on a
>> closed or maybe half-open FD.
>
> I doubt the "we are the only fd user so we are going to bypass Comm when
> we want to" is a good approach. ConnOpener calls many Comm methods.
> Those methods may create some state in Comm internal tables. We should
> call comm_close if we call comm_open instead of assuming that Comm has
> no state associated with our descriptor.
>
> If calling comm_close results in too many useless callbacks in our case,
> we should remove the no-longer-needed callbacks _before_ closing the
> descriptor.

Sorry, I responded to that without taking another look at the code.

ConnOpener::swanSong() actually calls the shared and public
Connection::close() which calls comm_close() and unsets the FD.

The behaviour you describe as should be happening was in fact happening.
Upper level removing its callbacks, lower level being told to close its FD.

>>> * ConnOpener::callCallback() clears the close handler and timeout
>>> depending on solo state, which is kind of wrong, but the best thing to
>>> do is to remove that cleanup code from callCallback() completely because
>>> the same code is already present in swanSong() and because it is
>>> unrelated to calling the callback.
>> Ideally yes. I'm not certain enough about the state of the Job on the
>> second (ConnectRetry) legacy entrance to remove either at this point.
>
> Do not let one bug cause others. I have discussed ConnectRetry fixes
> above, but swanSong and callCallback should not depend on what kind of
> bugs we have in ConnectRetry or we will be going in circles.
>

Okay. Fixed. And the crashes which resulted from that fix have in turn
been fixed... by rolling Comm::Connection out as the server FD handle
used by http.cc.
  One commit early, but its worked and was not as large as I had
thought. Though the non-http servers have yet to have their rollout to
fix the same problems over there.

Do you want to audit the code at this stage or wait till I finish that
next level of rollout into the Servers' ?

>
> * Do not call start() directly. Only AsyncStart should call it. And it
> should not be called more than once per job object lifetime. It is
> similar to main() in C/C++.
>

Fixed. Start now does comm_openex(), creates the calls and schedules the
connect one. Which contains the switch statement from the old code and
loops async.

>
>>> * It may not be a good idea to separate calls.earlyabort and
>>> calls.timeout existence from the existence of the corresponding Comm
>>> close or timeout handler. Set them when setting the Comm handler. Clear
>>> them when clearing the handler.

Fixed.

>>> * FwdState::unregister() used to prevent FwdState from closing the
>>> unregistered connection. The new code does not. Was the old code buggy?
>>>
>> Only in Comm::Connection semantics. Added an alternative version and
>> deprecated the old unregister(int) with a hack to keep it going on the
>> old semantics.
>
>
> I will try to remember to revisit this when you post the patch with the
> changes described above, but I may forget.
>

New code is this pair of methods:

void
FwdState::unregister(Comm::ConnectionPointer conn)
{
     debugs(17, 3, HERE << entry->url() );
     assert(serverConnection() == conn);
     assert(conn->isOpen());
     comm_remove_close_handler(conn->fd, fwdServerClosedWrapper, this);
}

// Legacy method to be removed in favor of the above as soon as possible
void
FwdState::unregister(int fd)
{
     debugs(17, 3, HERE << entry->url() );
     assert(fd == serverConnection()->fd);
     assert(fd > -1);
     comm_remove_close_handler(fd, fwdServerClosedWrapper, this);
     serverConnection()->fd = -1;
}

Yes, I do see that the fd = -1 was critical to the old code semantics.

This is a problem. unregister() can prevent the fwdServerClosedWrapper
being called from that point on. From a days running of Squid that seems
to be sufficient.

The better solution may be to not call ->close() or comm_close()
directly anymore from FwdState or any of the Servers. Relying on the
Comm::Connection to do its own closing on destruct.

>
>>> And why is it called "paths"? Each paths element (i.e., a path) only
>>> contains one point (the next hop). A path is a sequence of points.
>>> Should we call the "paths" member "hops", "peers", or "servers"?
>> Sequence of two location data points are stored in each conn:
>> (local,remote)
>>
>> The paths vector<> holds a series of such.
>>
>> Though I agree paths is not a good name, and I am leaning towards
>> remoteDestinations or somesuch now.
>
>
> "destinations" is fine with me though other choices are shorter. I would
> drop the "remote" prefix as it adds no meaning and may be confusing when
> a remote address is "localhost".
>
>
>
>>> Overall, exposing connection pointer to Servers seems wrong, especially
>>> because the pointer may change from one call to another. A Server that
>>> needs that pointer, should store it at initialization time, just like it
>>> stores fd in the current code.
>>>
>>> You have "TODO: store Comm::Connection instead of FD". This is an XXX
>>> and not just a TODO because of the above concerns.
>>>
>> This auditing task is scoped to do the connection setup and get
>> split-stack IPv6 a chance of going operational. Which means as little
>> upper layer change in the Servers as we can cope with.
>
> I appreciate the goal of minimizing the upper-level changes, but when
> the connection pointer change is outside Server control, it becomes
> dangerous to offer the Server access to it. It already caused one bug
> and we could easily miss more hidden ones. At the very least, let's add
> a "current connection, may change between calls" comment for the conn()
> method.
>
>
>> Is there any reason why the CommCalls have unique params classes and
>> dialers for CommTimeoutCb* and CommCloseCb* ? I'd kind of like to roll
>> them and CommConnectCb* into passing CommCommonCb* if thats possible,
>> and cut ~25% of the CommCalls code on the next cleanup step.
>
>
> You can merge CommTimeoutCb and CommCloseCb because they have the same
> arguments and such. The merge will remove a few lines of simple code,
> but you may lose the ability to know which callback is segfaulting based
> on call object alone, which is sometimes useful when dealing with core
> dumps. I doubt it is worth it, but have no strong objections.
>
> If the cleanup you have mentioned focuses on removing old function-style
> callbacks instead, then you would be able to remove a lot of CommCalls
> complexity and improve debugging.

Yes.

> Converting old-style function calls to use CommCommonCbParams is fine,
> but, again, it would be better to get rid of those calls instead.
>
>
>> All changed. Running my checks again now before commit and re-submit.
>
> It would be nice if you could run it with valgrind. A few bugs I was
> lucky to spot (e.g., double-destruction and peer leak) would have been
> identified by valgrind.
>

Will do.

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.5
Received on Tue Jul 20 2010 - 13:53:24 MDT

This archive was generated by hypermail 2.2.0 : Tue Jul 20 2010 - 12:00:09 MDT