Re: [PATCH] Comm::TcpReceiver - part 1, 2, and 3

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Sun, 09 Feb 2014 00:00:59 +1300

On 8/02/2014 8:05 p.m., Alex Rousskov wrote:
> On 02/07/2014 09:26 PM, Amos Jeffries wrote:
>> On 8/02/2014 1:54 p.m., Alex Rousskov wrote:
>>> On 02/05/2014 07:55 AM, Amos Jeffries wrote:
>>>> * Kept the TcpReceiver name for now since Alex is disagreeing with
>>>> callign the interface an "agent".
>
>>> I do not recall objecting to Agent.
>
>> You suggested Agent. Then went on to describe an object which was
>> self-contained. TcpReceiver has pure virtuals so cannot be what you seem
>> to describe as Agent (its childs would be that).
>
> I do not recall describing Agent as a self-contained object either.
> Agent will have pure virtual functions, of course. Event the existing
> ClientAgent (i.e., ServerStateData) does!

Oh you are looking at Agent as part of the hard-coded inheritence chain
of the protocol manager class?
 I have been looking at it as a dynamicaly assigned member of the
manager class. So that we can pass an Agent between say ConnStateData
and TunnelStateData. Or switch a ConnStateData Agent pointer between
TcpAgent and TlsAgent when doing ssl-bump on a client connection.

The server-facing Agents already are separate as you noted. They are
decended from ServerStateData ... but the manager over there is called
FwdState rather than ConnStateData/ClientSocketContexxt.

>
>>>> + MemBuf inBuf;
>>>
>>> The above undocumented TcpReceiver data member may be critical to
>>> defining TcpReceiver and related classes correctly. What does that
>>> buffer contain? Raw TCP bytes? I do not think so. Do we read raw TCP
>>> bytes into that buffer? Not quite. Those bytes did not necessarily got
>>> there from a TCP connection. They may have been through an SSL decoding
>>> layer using ssl_read_method() and Comm I/O API (at least).
>>>
>>
>> The intention is to have a separate TlsWhatever class with same API as
>> this one for the SSL connections reading and writing.
>
> I still think that would be a wrong design. The complex pieces of code
> you have moved to TcpReceiver have nothing to do with TCP or SSL. If you
> implement your plan, those pieces would be identical in TcpWatever and
> TlsWhatever. That's wrong.
>
> Let other layers handle the TCP and SSL specifics. The class you are
> writing now does not have any TCP- or SSL-specific code (unless I missed
> some minor bits) and that is good! Why make it TCP specific?
>

TLS is always used as a protocol over-layer on TCP connections. So in my
idea model Comm::TlsWhatever inherits the complex algorithms and
send/receive parts from TcpReceiver and replaces only the parts it needs
to and extends with API for TLS-specific needs.

>
>> As you know one cannot safely or cleanly mix randomly read(2) and
>> ssl_read_method() on one connection. Any solution we need to do soo
>> would be a third interface class.
>
> The actual read(2) and ssl_read_method() calls will be done by other
> layers and classes, orthogonal to the class you are working on now. The
> existing Comm API is already doing that (for now; Peek and Splice needs
> will require more work on those layers, but that is out of scope here).
>

From what I know of Peek-n-Splice you will need to be reading the data
in explicitly (as TCP?) and passing it to the BIO handlers, yes?
 That is easier to handle with a distinct TlsWhatever agent than relying
on the comm_*() abstraction layer to determine a read(2) / write(2)
equivalent (from the global fd_table BTW).

>
>>> That buffer actually contains "final protocol" bytes, for whatever
>>> protocol the kid agent is implementing (HTTP, Gopher, etc.). The whole
>>> TcpReceiver class is not about TCP at all!
>>
>> It's about bytes going to/from a TCP connection.
>
> Nope. The transport and any encodings/encryptions on top of it are
> pretty much irrelevant to the current (and future) TcpReceiver code. If
> you disagree, please show me important TCP-specific code in your current
> TcpReceiver implementation.
>
>
>> The bufer has nothing
>> to do with the format of the bytes inside it.
>
> Actually, the data format or protocol are critical. Virtually no code
> would be able to use that buffer for something useful without knowing
> what it contains! More on that below, with examples.

Thats the childs or managers knowledge. Not the TcpReceiver or read(2)
knowledge.

>
>> '''Common base for classes reading (and writing) from (and to) a single
>> TCP Comm Connection. Contains basic connection management code and
>> complex algorithms for coordinating I/O activity with message receiving
>> (and sending) states.'''
>
> Please show me the TCP specific code in your TcpReceiver implementation.
>

The algorithm model behind readSomeData() assumes a streaming protocol
with single input stream with serialized read(2).
 - eliminating direct use by SCTP, FTP, or other multi-connection protocols.

Use of comm_read() and Comm::Write() relies on a socket with TCP_STREAM
socket type, "Comm::Connection tcp;" is likewise a TCP_STREAM connection
assuming direct read(2)/write(2) access is possible.
 - all of these eliminate its use for UDP or UDS based protocol
receiving/sending.

The only crossover we seem to have is between TCP and TLS connection
types. And it just so happens that TLS is a wrapper layer over TCP
anyways. So the **emulation layer** we have in comm_*() for calling SSL
wrapper functions as if they were read(2) and write(2) works for this class.

 We could just as easily (for the model, not the code) have the child
class doing TLS decryption of encrypted bytes read(2) directly off the
socket in to TcpReceiver. Given the plans I've seen about
Peek-and-Splice using BIO handlers we will probably end up doing exactly
that in some code.

>
>> "protocol messages" are nothing to do with this class (they are the
>> child business and may not even exist for cases like tunnel.cc, or may
>> be several abstraction levels away).
>
> I agree that we _can_ remove messages from the specs of the current
> TcpReceiver class. The current implementation deals with a stream of
> bytes only. I see no important code that deals with message boundaries
> and such.
>
> I doubt that the class will stay that way because all the child classes
> I know about do deal with messages and connection persistency (in a
> general sense). I would expect the notion of a message to trickle down
> to TcpReceiver class sooner or later because the kids will want to share
> that code.

Yes that is why its a child class and we are not for example parsing
HTTP requests directly in this class with a type-code switch on the
read92) handler between HTTP and IDENT parsing.

It should already be capable enough for dealing with message-based
protocols via sendSomeData() and readSomeData(). The child / caller can
send an entire message in sendSomeData() buffer, and attempts to find
whole messages in readSomeData().

This class has no need to go nearer to messages than those generic calls.

>
> Please note that supporting a concept of a message and connection
> persistency across messages is different from knowing what the message
> protocol is. I am not advocating for adding HTTP or FTP code to the
> TcpReceiver, of course.
>
> In other words (and simplifying a bit):
>
> /// raw TCP bytes (WRONG: could be SSL-decrypted TCP)
> MemBuf inBuf;
>
> /// opaque received bytes (OK but a little vague)
> MemBuf inBuf;

Explicitly accurate for what is in there though. Which is why I left it
undocumented. "inBuf" seems clear enough.

>
> /// received bytes that are yet-unprocessed by our kids;
> /// opaque for us, but kids know the protocol these bytes
> /// are using (e.g., HTTP, FTP, etc) (Better)
> MemBuf inBuf;
>
> The actual final description may also mention what we expect the kids to
> do with the buffer (copy, leave just the unparsed bytes, empty on exit,
> etc.).

We expect them to process it. Whatever that may mean to the child. This
class places no restriction or expectation on that.

>
>>>> +bool
>>>> +Comm::TcpReceiver::doneAll() const
>>>> +{
>>>> + return stoppedSending() && stoppedReceiving() && !inBuf.hasContent() && AsyncJob::doneAll();
>>>> +}
>>>
>>> If we are done sending and receiving why would having buffered content
>>> (the third guard above) matter? What can we do with that content (in the
>>> scope of this class) that does not involve sending or receiving?
>>
>> Because until the child class has cleared/consumed the content out of
>> our buffer we cannot guarantee that it is finished with.
>
> That is kids' problem/scope, not ours. Just like we cannot guarantee
> that kids are finished with the buffer, we cannot guarantee that they
> are _not_ finished with it.
>
>
>> I am trying to
>> make the buffer private, so the child class doneAll() does not have
>> direct access to it except when passed in.
>
> I think it would be a mistake to deprive kids from input buffer access.
> They may need to have access to the buffer during events not triggered
> by I/O that TcpReceiver controls or knows about. For example, Store,
> ICAP, or even an async eCAP adapter need to be able to get more body
> bytes without any network I/O happening.
>
>
>> - For now I've added a note about that and put a matching inBuf.clean()
>> in keepaliveNextRequest() to clear it once the active requests queue has
>> been drained.
>
> This is just one bad side effect of the layering violation. Let the kids
> determine whether they are done with the data or not, without forcing
> them to clear the buffer to indicate that they are done (a pretty
> meaningless action for them since an empty buffer on its own often does
> not mean that they are done; just that they do not have any unprocessed
> input right now).
>

Okay. Moved the check on buffer finished to ConnStateData::doneAll().

>
>>>> +void
>>>> +Comm::TcpReceiver::releaseConnection(const char *reason)
>>>> +{
>>>> + // Used by clients to release the connection before
>>>> + // storing it in a Pconn pool for reuse.
>>>> + comm_remove_close_handler(tcp->fd, closed_);
>>>> + closed_->cancel(reason);
>>>
>>> s/clients/callers/ or s/clients/kids/ to avoid the impression that this
>>> code is specific to Client agents. It is not.
>>>
>>> "release" is kind of undefined in this context. I think you are talking
>>> about the need to stop monitoring connection for I/O. Consider
>>> rephrasing the comment to be more precise.
>>
>> I have been considering making this return the Comm::Connection object
>> and set the tcp member to NULL. Somewhat like pop() semantics. Since it
>> is used to *move* control of the TCP connection from this object to
>> somewhere else (ie to a pconn pool or maybe a TlsWhatever agent on
>> ssl-bump or Upgrade).
>>
>> Does that sound good?
>
> Does not sound bad, but I am not sure that fully resetting the tcp
> member itself is a good idea, unfortunately. Since your Comm::Connection
> class has two roles (storage of endpoints with other info AND,
> optionally, an active socket descriptor), deleting access to that first
> "info" part may not work well, even if we no longer control the second
> "socket" part. However, you can try and see what happens if you prefer,
> of course. Perhaps that info is already copied elsewhere...

Good point.

>>>> +bool
>>>> +Comm::TcpReceiver::injectFakeRequestXXX(MemBuf &fake)
>>>> +{
>>>> + if (inBuf.hasContent())
>>>> + fake.append(inBuf.content(), inBuf.contentSize());
>>>> +
>>>> + inBuf.reset();
>>>> + inBuf.append(fake.content(), fake.contentSize());
>>>> +
>>>> + return processReadBuffer(inBuf);
>>>> +}
>>>
>>> Let's not combine injection with processing. Let the caller call
>>> processReadBuffer() if/when they are ready for that. It would make the
>>> method more powerful but simpler. And you would not have to fix its
>>> documentation to describe the return value, etc.
>>>
>>> Another layer of problems with this method are two similar asserts: One
>>> when the "fake" buffer is too small and the other one is when the inBuf
>>> is too small to accommodate the combined content. I will focus on the
>>> former.
>>>
>>> We should not assume that the fake MemBuf can accommodate the entire
>>> inBuf. MemBufs have limited capacity and it is too much to ask the
>>> caller to give us the right buffer. If you disagree, please adjust the
>>> method description to explain the requirements on the buffer parameter.
>>> If you agree, make the buffer parameter constant and use a temporary
>>> buffer (easy) or some kind of new MemBuf method with memmove() to inject
>>> content.
>>
>> I was trying to reduce data copies here.
>
> Note that you are already increasing the number of data copies: The
> official code does not have any copies in this context AFAICT, but
> suffers from many other problems. Some copying may be necessary to fix
> those problems. It will go away when/if we start supporting multi-buffer
> parsing and I/O.
>
>
>> Oh well we will have to go with
>> 3x data copy and two re-allocations on the common case for intercepted
>> HTTPS.
>
> You do not have to use three copies if you do not want to. You can add a
> method to MemBuf to insert a prefix with one memmove() followed by one
> memcpy().
>
> On the other hand, I suspect these copies are all minor compared to
> other SSL overheads so it may make sense to keep this code simple rather
> than make it fast.
>
>
>>> The code should not assert (because I might be able to trigger such an
>>> assert by stuffing the input buffer with a lot of content). A throwing
>>> Must() would be OK in this hypothetical case IMO because it will just
>>> terminate a single transaction.
>>
>> Dangerous to call Must() as these are possibly outside of AsyncJob
>> scope. IDENT and WHOIS lookup etc will likely not be AsyncJob by the
>> time this could be rolled out there.
>
> Neither IDENT nor WHOIS are going to use fake message injection, I hope.
>
> More importantly, Must() is never more dangerous than assert. More
> inconvenient if triggered outside of job protections (no stack trace),
> yes, but not more dangerous.
>
>
>>> Overall, I am not sure this method is needed. The caller can do the
>>> manipulations on its own, right? It may be easier for the caller to add
>>> the required capacity checks than implement this general code correctly.
>>>
>>
>> Ideally not. I am trying to move the inBuf to a private member so the
>> childs should only be playing with the buffer they get passed by
>> processReadBuffer(MemBuf &)
>
> As you know by now, I do not think that hiding input buffer from kids is
> going to work unless you force kids to copy it. TcpReceiver does not
> control all the events in a transaction life and, hence, cannot prohibit
> access to the buffered message which may be waiting for those events.

By making this a private/protected member we restrict access to it from
objects outside the child class and restrict parsing/processing actions
on it to the code where it is passed as parameter to
processReadbuffer(Membuf &).

Amos
Received on Sat Feb 08 2014 - 11:01:28 MST

This archive was generated by hypermail 2.2.0 : Sat Mar 01 2014 - 12:00:14 MST