Re: [PATCH] regular expression optimisation patch for squid 3.1.14

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 11 Jul 2011 18:30:46 -0600

On 07/05/2011 01:14 PM, Marcus Kool wrote:
> Attached is a patch for optimisation of REs.
> This is the second submission of the patch and the comments from
> Amos' review are addressed.
>
> This patch is inspired by the work that I did for ufdbGuard and a few
> emails with Amos.
>
> The new code optimises lists of regular expressions.
> The optimisations are:
> * initial .* is stripped

Initial '.'s are stripped as well. If that code survives the audit, the
change should be documented. More on that change below.

> * RE-1 RE-2 ... RE-n are joined into one large RE: (RE-1)|(RE-2)|...|(RE-n)
> * -i ... -i options are optimised: the second one is ignored, same for +i

> + while (*t == '.') {
> + t++;

In the above code, you remove leading "."s which may change the regular
expression. For example, the following RE will match "01234" but will
not match "1234" or "234":

    ..234

After your patch, that RE will be converted to "234" and match all three
strings. If you agree that this is a bug, please enhance your test cases
before you fix the bug.

> + if (*t == '\0') {
> + debugs(28, DBG_IMPORTANT, "" << cfg_filename << " line " << config_lineno << ": " << config_input_line);
> + debugs(28, DBG_IMPORTANT, "WARNING: regular expression '" << orig << "' has only wildcards and matches all strings." );
> + return orig;
> + }

The above feels a little inconsistent. The code adjusts RE in all other
cases but returns orig in this specific case. Should we return "^.*" or
a similar simplified RE instead?

> + if (t != orig) {
> + debugs(28, DBG_IMPORTANT, "" << cfg_filename << " line " << config_lineno << ": " << config_input_line);
> + debugs(28, DBG_IMPORTANT, "WARNING: regular expression '" << orig << "' has unnecessary wildcard(s)" );
> + }

Please report what the simplified RE will look like so that the admin
can see their mistake (and can check our simplification logic).

The same comment may apply to compileOptimisedREs(): I am not sure, bu
perhaps we should report (at level 1) any changes we apply to
user-supplied REs?

Needless to say, any mistakes here may lead to hard-to-find access
control bugs in real configurations. Let's inform the user that we are
fiddling with their configurations. It is probably better to come back
and remove some of the warnings than have to come back and remove bugs
that went undiscovered for a long time.

> + debugs(28, DBG_IMPORTANT, "WARNING: optimisation of regular expressions failed; using fallback method without optimisation" );

Please report the optimized RE that failed to compile if possible.

Please try to post a patch that includes RegexData.cc changes and new
test cases. To do that, you would need to write test cases to call your
new (and possibly some old) functions directly, without running Squid.
There are many such test examples in Squid sources.

Thank you,

Alex.

> attached are the patch (RegexData.cc.patch) and files for a unit test:
> squidtest.conf
> re.4lines - used in squidtest.conf; contains REs
> re.200lines - used in squidtest.conf; contains REs
> unittest_re_optim_wget - script with wget commands to trigger squid to
> evaluate REs
>
> unittest_re_optim_wget contains instructions on how to setup and perform
> a unit test
>
> I tried to get a member of the squid-dev mailing list but are not yet
> so comments should also go to my email address directly.
>
> Marcus Kool
>
>
>
> Marcus Kool wrote:
>>
>> Amos Jeffries wrote:
>>> > Amos Jeffries wrote:
>>> >> Hi Marcus,
>>> >> Did my audit feedback on this make it to you? I've just noticed my
>>> >> mailer has not marked the thread as responded.
>>> >>
>>>
>>> On 01/07/11 00:52, Marcus Kool wrote:
>>>> No, it did not.
>>>
>>> Okay. My mailer seems to have screwed up badly. There were a few
>>> little minor bits.
>>>
>>> * the patch being reversed. Just order the files the other way
>>> around on next patch.
>>>
>>> compileOptimisedREs/compileUnoptimisedREs have duplicate code
>>> checking for (RElen > BUFSIZ+1) case on the wordlist key. They are
>>> already checked for that criteria by aclParseRegexList before adding.
>>>
>>> debugs() WARNING to the user should be DBG_IMPORTANT in the second
>>> parameter.
>>>
>>> The major problem debugs() need DBG_CRITICAL in parameter #2 and
>>> "ERROR:" instead of the function name.
>>>
>>> The >100 messages only need to be shown when checking the config for
>>> problems. ie.
>>> debugs(28, (opt_parse_cfg_only?DBG_IMPORTANT:2), ....
>>
>> Thanks for the feedback, I will make a new patch. I was not able to
>> do it to be included in the next releases but it will be soon.
>>
>>>
>>> None else has mentioned anything, so with these style tweaks it can
>>> go in. The next releases are planned to happen tomorrow. If you want
>>> to submit a new patch in the next 12hrs I'll use that.
>>>
>>>>
>>>> I tried to subscribe to the squid-dev mailing list the other day
>>>> but got no reply yet. But in the list archives I did not see any
>>>> response/feedback either.
>>>
>>> I saw that arrive. So whoever was moderating this week appears to
>>> have has okayed you for posting. If you went through the regular
>>> ezmail subscription process (mail to
>>> squid-dev-subscribe_at_squid-cache.org) you should have been receiving
>>> list mail for a few days?
>>
>> I have not yet received emails from squid-dev. Should I resend
>> the application ?
>>
>>> Amos
>>
>> Marcus
>>
Received on Tue Jul 12 2011 - 00:32:23 MDT

This archive was generated by hypermail 2.2.0 : Tue Jul 12 2011 - 12:00:03 MDT