Re: [PATCH] log received body size

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Tue, 03 Aug 2010 10:39:54 -0600

On 08/03/2010 03:27 AM, Amos Jeffries wrote:
> Tsantilas Christos wrote:
>> Added %http::<bs and %icap::<bs logformat codes to HTTP and ICAP body
>> sizes received from the next HTTP hop or the ICAP server.
>>
>> Logging "received message body" is useful because Squid may receive a
>> lot more or a lot less than it serves to the client or than the
>> original resource size, which may happen when handling Range requests
>> and partial responses, when adapting bodies, and for other reasons.
>>
>> For HTTP, we define "received message body" as message body bytes that
>> Squid stores, merges, adapts, and/or forwards. In most cases, they are
>> the same as body bytes sent by the server to Squid. However, the two
>> bodies may differ for reasons such as errors (where the start of the
>> body was not found), HTTP transfer encodings (where Squid strips
>> chunked encoding to find the message body), and generated FTP
>> directory listings (that were received in a completely different
>> format on a control connection).
>>
>> For ICAP, the "received message body" is the Encapsulated sections,
>> after the encapsulated HTTP body, if any, is dechunked.

> src/Server.cc:
>
> ServerStateData::addVirginReplyBody(const char *data, ssize_t len)
> {
> + adjustBodyBytesRead(+len);
>
> whats with the +?

Hi Amos,

     I will respond to this specific question because the plus sign is
my fault. The adjustBodyBytesRead() function accepts zero, negative, and
positive deltas. The plus emphasizes that this specific delta is
[probably] positive. I believe the posted patch version does not even
use negative deltas, making the plus sign even more confusing. We will
remove the plus sign.

> * I've been a bit uncomfortable since the beginning about the
> de-chunking happening before the body bytes are accounted. This
> introduces some loss into the protocol accounting apparently being done.
> Are those lots bytes accounted for as headers? or is the total reply
> bytes value going to be larger than the sum of headers+body ?
> Do we stop at de-chunking? what about un-zipping? etc...
>
> It's your implementation but I'm of the opinion this should log the
> on-wire data size of the segment used for body. Including the chunking
> bytes, and excluding the Squid-added markup bytes on FTP/Gopher listings.

I am glad you raised this question because we have struggled with these
issues a lot ourselves. Earlier patch versions did try to count
on-the-wire body bytes before dechunking and faced many unforeseen
obstacles. After many patch versions, I can identify the biggest problem
as follows:

     Identifying header and body boundaries is protocol/encoding/etc.
specific. Counters that are specific to protocols/encodings/etc. should
not use "on the wire" approach. And, vice versa, "on the wire" counters
should not pretend to know where the "header", "body", or other
protocol/encoding/etc-dependent parts are.

Squid generates a lot of content, under many different conditions. When
I talk about protocol/encoding/etc counters, I consider
generation-specific counters to belong to that "etc" category.

We may need more "one the wire" counters. We may need more
protocol/encoding/etc-specific counters. The current set of counters is
incomplete and inconsistent, but this patch is not meant to solve this
problem.

The scope of this patch is narrow: log the size of what Squid perceives
as "received response body". This is needed to better account for
[small] 206 Partial Content responses [with huge resource variant lenghts].

Does this patch provide enough information to reconstruct all sizes of
all protocol/encoding/etc. components? No. However, such full
reconstruction is rather difficult, and it has not been requested yet. I
am not sure, but I suspect that logging sizes for every possible
protocol/encoding/etc. part will be impractical. The advantage of this
patch is that it provides usable information at a common point for all
Server-using protocols/encodings/etc.

For example, if we want to track chunked encoding overheads, we can
compare raw "bytes read from the wire" counter with a sum of HTTP
headers and body counters. This comparison will be accurate in most
cases, but there will be exceptions such as multipart responses or
generated FTP listings. Smart script can probably exclude these
exceptional cases, given enough access.log fields, but it will always be
difficult unless we add a specific "chunked encoding overhead" counter.

In the foreseeable future, we will add gzip encoding support. This patch
will still work as intended when we do that, regardless of whether we
unzip before and/or after Store. However, we may need to add more
counters to track effects and effectiveness of gzip-related algorithms.

Another complication is 1xx control messages. This patch does not
account for them. Some other counter, especially a "received
one-the-wire bytes" counter will. I would certainly recommend adding
such a counter (we already have one for ICAP), but (a) that addition is
outside this patch scope and (b) that counter should not try to identify
where the message body is.

If there are no strong objections, I recommend preserving the counting
point of this patch (Server::addVirginBody) while acknowledging that the
new counter does not count everything we might want to count. I am happy
to change the logformat code from %<bs to something else if you can come
up with a better name.

Thank you,

Alex.
Received on Tue Aug 03 2010 - 16:40:03 MDT

This archive was generated by hypermail 2.2.0 : Tue Aug 03 2010 - 12:00:04 MDT