Re: [PATCH] regular expression optimisation patch for squid 3.1.14

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 12 Jul 2011 17:12:18 +1200

On 12/07/11 12:30, Alex Rousskov wrote:
> On 07/05/2011 01:14 PM, Marcus Kool wrote:
>> Attached is a patch for optimisation of REs.
>> This is the second submission of the patch and the comments from
>> Amos' review are addressed.
>>
>> This patch is inspired by the work that I did for ufdbGuard and a few
>> emails with Amos.
>>
>> The new code optimises lists of regular expressions.
>> The optimisations are:
>> * initial .* is stripped
>
> Initial '.'s are stripped as well. If that code survives the audit, the
> change should be documented. More on that change below.
>
>> * RE-1 RE-2 ... RE-n are joined into one large RE: (RE-1)|(RE-2)|...|(RE-n)
>> * -i ... -i options are optimised: the second one is ignored, same for +i
>
>
>> + while (*t == '.') {
>> + t++;
>
> In the above code, you remove leading "."s which may change the regular
> expression. For example, the following RE will match "01234" but will
> not match "1234" or "234":
>
> ..234
>
> After your patch, that RE will be converted to "234" and match all three
> strings. If you agree that this is a bug, please enhance your test cases
> before you fix the bug.
>

Yes. The first version of that loop was correct:
   while (*t =='.' && *(t+1) == '*') t+=2;

>> + if (*t == '\0') {
>> + debugs(28, DBG_IMPORTANT, ""<< cfg_filename<< " line "<< config_lineno<< ": "<< config_input_line);
>> + debugs(28, DBG_IMPORTANT, "WARNING: regular expression '"<< orig<< "' has only wildcards and matches all strings." );
>> + return orig;
>> + }
>
> The above feels a little inconsistent. The code adjusts RE in all other
> cases but returns orig in this specific case. Should we return "^.*" or
> a similar simplified RE instead?

Right. ".*" should probably be returned instead.

The catenation assembly could then detect this case and drop all other
OR'ed patterns from the compound. Though I'm not asking for that in this
patch, it can be an extra tuning optimization later.

>
>> + if (t != orig) {
>> + debugs(28, DBG_IMPORTANT, ""<< cfg_filename<< " line "<< config_lineno<< ": "<< config_input_line);
>> + debugs(28, DBG_IMPORTANT, "WARNING: regular expression '"<< orig<< "' has unnecessary wildcard(s)" );
>> + }
>
> Please report what the simplified RE will look like so that the admin
> can see their mistake (and can check our simplification logic).
>
> The same comment may apply to compileOptimisedREs(): I am not sure, bu
> perhaps we should report (at level 1) any changes we apply to
> user-supplied REs?
>

FYI: I'm using the syntax "WARNING: blah. Using XYZ instead." when
adding these type of things to the config parser.
  With the word "Use" if Squid is not making a change but instructing
the admin. The word "Using" if Squid has auto-dropped the config and
taken our alternative.

> Needless to say, any mistakes here may lead to hard-to-find access
> control bugs in real configurations. Let's inform the user that we are
> fiddling with their configurations. It is probably better to come back
> and remove some of the warnings than have to come back and remove bugs
> that went undiscovered for a long time.
>
>
>> + debugs(28, DBG_IMPORTANT, "WARNING: optimisation of regular expressions failed; using fallback method without optimisation" );
>
> Please report the optimized RE that failed to compile if possible.
>
>
> Please try to post a patch that includes RegexData.cc changes and new
> test cases. To do that, you would need to write test cases to call your
> new (and possibly some old) functions directly, without running Squid.
> There are many such test examples in Squid sources.
>
>
> Thank you,
>
> Alex.
>
>

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.14
   Beta testers wanted for 3.2.0.9
Received on Tue Jul 12 2011 - 05:12:25 MDT

This archive was generated by hypermail 2.2.0 : Tue Jul 12 2011 - 12:00:03 MDT