Re: Squid is becoming slow with 3000 ACL's ... from David Luyer on 1998-09-08 (squid-dev)

From: David Luyer <luyer@dont-contact.us>
Date: Wed, 09 Sep 1998 13:56:18 +0800

Duane wrote:
> > A regular expression comparison only retusn equal, or not-equal.
> >
> >You could sort them by match frequency, No? The most frequently matching
> >patterns are then near the root of the tree.
>
> I don't think so.
>
> The way the tree searching works is that if a comparison at one
> node returns "less than", then we know the match, if any, must
> be down one side of the tree, and cannot be down the other side.

However a group of regular expressions may be put into a tree in an ugly sort
of way, given some restrictions.

A "node" would have to be able to contain multiple entries, for the case where
two regular expressions have the same fixed prefix. There would have to be
a separate "node" not in the tree for things which were with no prefix. It
would be ugly, but you could conceivably work out something like this;

LIST "fuzzy": regexps where neither end contains any "firm" text
eg: [Xx][Xx][Xx]

TREE "left-firm": tree sorted by prefix of lists of regexps (those with
                  the same fixed prefix have to end up at the same node)
                  with a "firm" prefix. this tree could include all
                  plain words as well.
                  eg: free.*sex

TREE "right-firm": tree sorted by suffix of lists of regexps ending in "firm"
text (again, lists if saim suffix for multiples).
eg: [Xx][Xx][Xx].*free

TREE "left-anchored": regexps starting with $ followed by at least one
"firm" character (multiples with same prefix -> same
"node" list)

TREE "right-anchored": regexps ending in $ where the last character is
fixed (again, list of regexps if multiples with the
same suffix)

Now searching for a match becomes a case of...

* search for a match on regexps on the "fuzzy" list
* simple tree search for matches on left-anchored and right-anchored
* for every start/end position, tree search for matches on left-firm and
right-firm (!!!)

This last one might be more expensive than intelligent text-search algorithms
if there are a small number of large bare words to search for. However
in the case of thousands of ACL's it should be a win if the ACL's are well
formed and URLs are assumed to be typically quite short. I assume blasting
long URLs at a cache with 3000 regexps in its ACLs is already quite a good
DoS.

If there's enough ACLs, a hash table of trees might give benefit for the
left-firm and right-firm case - say, 256 trees, one for each start char,
then at each position you know immediately which tree to search on.

Of course there's a lot of possibilities and a lot of code to do them.
Maybe a good regexp package already does some amount of this, but last
time I looked at GNU egrep they said they had a good speedup for the
a|b|c|d-type regexp case (essentially what we're doing, combining many
regexps with or), which just happened to be exponential on memory
requirements in some cases - I don't think we really want that one. :)

Since I don't believe in URL blocking, I'm not going to actually try
any code for this. I was just pointing out that using trees for regexp
ACLs isn't impossible, simply somewhat more complex than for normal
ACLs. And in the worst case it simply collapses down to the same efficiency
as a list implementation.

David.
Received on Tue Jul 29 2003 - 13:15:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:55 MST