Re: [PATCH] Request URI logging for malformed requests from Tsantilas Christos on 2011-03-04 (squid-dev)

From: Tsantilas Christos <chtsanti_at_users.sourceforge.net>
Date: Fri, 04 Mar 2011 19:31:16 +0200

Looks good Amos, I will make a new patch based on your
rfc1738_do_escape. Will you commit it to trunk?

On 03/04/2011 03:28 PM, Amos Jeffries wrote:
> On 04/03/11 05:23, Tsantilas Christos wrote:
>> On 03/02/2011 03:00 AM, Amos Jeffries wrote:
>>> On Tue, 01 Mar 2011 20:29:51 +0200, Tsantilas Christos wrote:
>>>> This patch is forgotten too.
>>>> Any comments?
>>>>
>>>
>>> Just doing a quick audit now...
>>>
>>> ......
>>> "
>>>
>>> src/client_side.cc:
>>> - the logic is wrong
>>> - in setLogUri() you do not need to call stringHasCntl(uri). The
>>> unescape call will do a one-pass scan and check these as it goes.
>>> also an old bug is now visible, stringHasCntl() does not check the full
>>> set of characters that need encoding :(
>>
>> Well, does not check for rfc1738_unsafe_chars, is not difficult to fix
>> it. I can include a fix in this patch if required.
>>
>>>
>>> With that fixed its clear that stringHasWhitespace() needs to be
>>> performed *before* any escaping and duplicating is done. Some shuffling
>>> can avoid doing that scan pass entirely on most config setups.
>>
>> This is true, the rfc1738_escape_unscaped, escapes the whitespace so
>> user whitespace policy are not applied.
>>
>> Your patch is not correct too, because the uris must always escaped if
>> they contain non printable characters, before logged. Your code only
>> take cares for whitespaces...
>
> Arg. Right.
>
>>
>> A better solution requires 2 strdup/malloc if there are both whitespaces
>> and ctl characters in the uri. I do not know how often are URIs with
>> whitespaces and ctl characters but I do not like ...
>>
>> The problem was to find a way to handle the cases you have unescaped
>> URIs with whitespaces which causes problem on log analysis.
>> In my original patch if there are cntl characters escape them and escape
>> the whitespaces too. In the case there are not cntl characters apply
>> user defined whitespace policy....
>> Yes the above is not the best, but I think it is fair for most cases.
>>
>>
>> The overall source code which is related with http client request
>> parsing is a little confused, and I believe that we are spending a lot
>> of CPU cycles checking and parsing again and again the same strings.
>>
>> My opinion is that a better solution is to rewrite or add a new
>> rfc1738_escape_unescaped library function as follows:
>>
>> rfc1738_escape_unescaped_uri(const char *uri,
>> const char *escaped_uri, size_t escaped_uri_len,
>> int whitespace_policy = URI_WHITESPACE_ENCODE)
>>
>> uri: the original URI
>> escaped_uri: the buffer to copy the URI
>> escaped_uri_len: the length of the buffer
>> whitespace_policy: method to handle the whitespaces
>> (URI_WHITESPACE_ALLOW, URI_WHITESPACE_ENCODE etc.... )
>
> No. our local whitespace policy has no place in the encoding algorithm.
> It is used by many places for many non-URL things.
>
> The current separation of a whitespace policy function setLogUri() which
> happens to use rfc1738 escaping is good. The problem is limits on the
> algorithm API.
>
> Attached is a patch for the rfc1738 API which fixes its flags to enable
> omitting the character groups separately.
>
> I noticed there was no short-circuit to speed up the scan over the very
> common alpha-numeric chars. Added. That should speed up the encoding a lot.
>
> There are two extra performance boosts possible here but outside the
> scope of any of this work,
> * using malloc instead of calloc. If necessary at all memset() is only
> needed on the final unfilled part of the array.
> * update the API to receive an optional buffer to be filled, OR to pass
> out the alloc'd buffer for the caller to deal with.
>
>>
>> The above will save at list 2 full string passes (stringHasWhitespace()
>> and stringHasCntl() functions) for every URI to be logged and in worst
>> case 4 string passes and 2 strdup/mallocs
>
> Yes, neither of those are needed at all.
>
>>
>> Regards,
>> Christos
>>
>>>
>>> What I think is needed is:
>>>
>>> if (!cleanUrl)
>>> ...
>>> else {
>>> switch (Config.uri_whitespace) {
>
> After the rfc1738 API is fixed this can become:
>
> int flags = 0;
> switch (Config.uri_whitespace) {
> case URI_WHITESPACE_ALLOW:
> flags |= RFC1738_ESCAPE_NOSPACE;
>
> case URI_WHITESPACE_ENCODE:
> flags |= RFC1738_ESCAPE_UNESCAPED;
> http->log_uri = xstrndup(rfc1738_do_escape(uri, flags), MAX_URL);
> break;
>
>>>
>>> case URI_WHITESPACE_CHOP:
>>> http->log_uri = xstrndup(uri, min(MAX_URL,strcspn(http->log_uri,
>>> w_space)));
>>> break;
>>>
>>> case URI_WHITESPACE_DENY:
>>> case URI_WHITESPACE_STRIP:
>>> default:
>>> {
>>> char *q = static_cast<char*>(xmalloc(strlen(uri)*3 + 1));
>>> const char *t = uri;
>>> while (*t) {
>>> if (!xisspace(*t))
>>> *q++ = *t;
>>> t++;
>>> }
>>> *q = '\0';
>>> http->log_uri = xstrndup(rfc1738_escape_unescaped(q), MAX_URL);
>>> xfree(q);
>>> }
>>> }
>
> Amos
Received on Fri Mar 04 2011 - 17:31:35 MST

This archive was generated by hypermail 2.2.0 : Sat Mar 05 2011 - 12:00:02 MST