Re: [squid-users] Access control : How to block a very large number of domains

From: Marcus Kool <marcus.kool_at_urlfilterdb.com>
Date: Fri, 26 Jun 2009 13:55:41 -0300

Hi Hims,

I am the author of ufdbGuard which is based on squidGuard.
ufdbGuard is free software which does 50000 URL lookups/sec
on a recent CPU and has no problems with large databases.

-Marcus

hims92 wrote:
> hello,
> I performed the tests (to block sites using squidguard) with some less
> domains but squid did not respond properly, that is the network got slow.
>
> squid-2.5.STABLE11.tar
> squidGuard-1.2.10.tar
> Berkeley DB 4.2.52
>
> number of domains in black list - 656490 (0.6 million) ; urls - 141581 (0.1
> million)
> Peak time requests - 200/sec
>
>
> Amos Jeffries-2 wrote:
>> On Mon, 15 Jun 2009 12:26:16 -0700 (PDT), hims92
>> <himanshu.singh.cse07_at_itbhu.ac.in> wrote:
>>> Hi,
>>> As far as I know, SquidGuard uses Berkeley DB (which is based on BTree
>> and
>>> Hash tables) for storing the urls and domains to be blocked. But I need
>> to
>>> store a huge amount of domains (about 7 millions) which are to be
>> blocked.
>>> Moreover, the search time to check if the domain is there in the block
>>> list,
>>> has to be less than a microsecond.
>>>
>>> So, Will Berkeley DB serve the purpose?
>>>
>>> I can search for a domain using PATRICIA Trie in less than 0.1
>>> microseconds.
>>> So, if Berkeley Trie is not good enough, how can I use the Patricia Trie
>>> instead of Berkeley DB in Squid to block the url.
>> Do do tests with such a critical timing you would be best to use an
>> internal ACL. Which eliminates networking transfer delays to external
>> process.
>>
> Can you a bit more specific how to do that; am pretty new to squid.
>
>
>> Are you fixed to a certain version of Squid?
>>
> No am not. But presently, my institution has :
> squid-2.5.STABLE11.tar
> squidGuard-1.2.10.tar
> Berkeley DB 4.2.52
>
> And would like to find the solution, if possible for these versions only.
>
>
>
>> Squid-2 is not bad to tweak, but not very easy to add to ACL either.
>>
>> The Squid-3 ACL are fairly easy to implement and drop a new one in. You
>> can
>> create your own version of dstdomain and have Squid do the test. At
>> present
>> dstdomain uses unbalanced splay tree on full reverse-string matches which
>> is good but not so good as it could be for large domain lists.
>>
> How to create our own version of dstdomain?
> Does the earlier versions(2.x) of squid also use unbalanced splay tree for
> searching a url/domain or do they use linear search, binary search or some
> other efficient search technique.
> Is it possible to may be store all the domains and urls (0.7 million approx)
> in a vector (STL) and then perform binary_search to find the result of the
> query?
> I tested the binary_search in a stand alone cpp program, and the query time
> was pretty satisfactory for me.
>
> How does squid handle the requests for domain ips? Does it stores all domain
> ips somewhere or first perform a dns lookup for the domain name and then
> searches for whether its in deny/access list or not before giving access?
Received on Fri Jun 26 2009 - 16:55:49 MDT

This archive was generated by hypermail 2.2.0 : Fri Jun 26 2009 - 12:00:04 MDT