[PATCH] HTTP Parser upgrade

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 20 May 2014 23:17:26 +1200

On 20/05/2014 4:28 a.m., Alex Rousskov wrote:
> On 05/19/2014 12:29 AM, Amos Jeffries wrote:
>
>> Updated patch attached. PREVIEW since the file renaming has been
>> omitted, as have incomplete Tokenizer related changes.
>
> I have not reviewed most of the details, but the preview looks good to
> me overall.

Okay. Attached patch then for (final?) audit, includes all the changes
below and does the file renaming polish.

NP: I've manually cropped the Tokenizer branch bits imported to the
parser-ng branch. Please ignore the "src/parser/Makefile" bit in
configure.ac unless you are patching for a build test, in which case you
will need to cut out post-patching for this all to build.

>
> New temporary SBuf allocations for message parts will be removed as you
> polish the parsing code, right? Or are you proposing to commit those
> performance regressions first?

The temporary allocations for message parts should already be gone. Now
that trunk I/O buffer is an SBuf the message parts should be just
buf.substr() references to that I/O MemBlob backing store.

In the polishing now done I've now also been through and removed several
uses of c_str() causing temporary copies elsewhere. There are some (2
AFAICS) remaining where the SBuf content gets passed out to char* code.
Those are transitional and will disappear as we roll later SBuf
conversion or later parser patches into trunk.

The regression marked in prepareAcceleratedURL() is awaiting either
Tokenizer or Url class completion. Either one will obsolete the
strstr("//") URL parsing action to find path bytes.

>
> At some point, please check for The Big Three rule violations in new or
> rewritten classes. Removing _unused_ copying is fine if it is easier
> than correctly supporting all three.

Checked. There is no need for any of the Big Three AFAICT. Added
destructor and removal of copying. But this is I think a case for using
the default destructor.

>
> Please add "explicit" to any new/modified single-required-parameter
> constructor or document why we want to support silent type conversion.
> Parser(SBuf) is the one I noticed but there may be others.
>

Fixed. With the parse() API change discussed below this constructor
disappears and the other new ones are already "explicit".

>
>> /** HTTP protocol parser.
>> *
>> * Works on a raw character I/O buffer and tokenizes the content into
>> + * either an error state or HTTP procotol major sections:
>
> It is not possible to tokenize something into an error state (unless you
> are parsing errors), and major protocol sections are not exactly
> "tokens" either. Suggestion:
>
> /** Parses raw buffer to extract major HTTP message sections: ...
>
> (the existence of an error state is obvious for any parser and does not
> need to be documented in the class description IMO, but you can add it
> if you prefer, of course).

Re-written as:

/** HTTP/1.x protocol parser
 *
 * Works on a raw character I/O buffer and tokenizes the content into
 * the major CRLF delimited segments of an HTTP/1 procotol message:
 *
 * \item first-line (request-line / simple-request / status-line)
 * \item mime-header 0*( header-name ':' SP field-value CRLF)
 */

and
/** HTTP/1.x protocol request parser
 *
 * Works on a raw character I/O buffer and tokenizes the content into
 * the major CRLF delimited segments of an HTTP/1 request message:
 *
 * \item request-line (method, URL, protocol, version)
 * \item mime-header (set of RFC2616 syntax header fields)
 */

>
>
> Please consider s/isDone()/done()/ -- see the last blob in "Member
> naming" item #3 at http://wiki.squid-cache.org/Squid3CodingGuidelines

Renamed needsMoreData() as per below.

>
> Consider s/doneBytes/parsedBytes/ although I think that method is not
> needed right now, as discussed further below.

Removed as discussed below.

>
>> + /** Whether the parser is already done processing the buffer.
>> + * Use to determine between incomplete data and errors results
>> + * from the parse.
>> + */
>> + bool isDone() const {return parsingStage_==HTTP_PARSE_DONE;}
>
> s/determine/distinguish/

Fixed.

>
> However, it is not clear how and why isDone() should be used for
> distinguishing errors from "need more data". I suspect that this needs
> to be renamed or redescribed but I am not sure about your intent here.
> Perhaps this would work?
>
> /// Whether we finished parsing (successfully or otherwise).
>
> Personally, I find the "reverse" methods like "needsMoreData()" easier
> to work with, but that is a matter of taste:
>
> if (parser->needsMoreData())
> return; // wait for more bytes (clear and simple)
>
> if (!parsedAll)
> ... does not need more and failed to parse: must be an error ...
>
> versus:
>
> if (!parser->isDone())
> What does isDone() return on errors? And we are negating, so ...
>
> if (!parsedOk)
> ... handle error or does it need more data? ...
>

Going with needsMoreData().
But parsedOk looks right to me, by the time its checked it indicates
parse error/okay state.

>
>
>> + if (hp.doneBytes()) {
>> + // we are done with some of the buffer. update the ConnStateData copy now.
>> + csd->in.buf = hp.buf;
>> + }
>
> I recommend removing the doneBytes() condition to simplify code and
> minimize the number of different paths we need to worry about:

Done.

>
>
>> +namespace Http {
>> +namespace One {
> ...
>> /** HTTP protocol parser.
>
> Placement inside Http::One suggests that this is an HTTP/1 protocol
> parser, not a generic HTTP protocol parser. Please document as such if
> the placement is correct.

Done. "HTTP 0.9 and 1.x protocol parser"

>
>
>> + Parser(const SBuf &aBuf) { reset(aBuf); }
>
>> - void reset(const char *aBuf, int len);
>> + void reset(const SBuf &aBuf);
>
> Have you tried removing the above constructor and reset() method while
> supplying the buffer directly to the parse method as a parameter
> instead? I suspect it may be better than doing this direct manipulation
> of the parser buffer:

Seems to be better as it allows several other simplificatiosn to take
place. Done.

Amos

Received on Tue May 20 2014 - 11:17:42 MDT

This archive was generated by hypermail 2.2.0 : Tue May 20 2014 - 12:00:14 MDT