Re: looking for the easiest way to filter urlīs

From: Kevin Fink <kevin@dont-contact.us>
Date: Mon, 28 Sep 1998 17:41:26 -0700 (PDT)

On Mon, 21 Sep 1998, Detlef Mauritz wrote:

> what is the easiest way to filter unwanted urlīs with squid? We want to
> connect some schools to the internet and they should not see every
> webpage.

Squid can't do an effective job of blocking unwanted URLs without some
fairly substantial code changes. After about 3000 ACL entries or so the
ACL code bogs down and the proxy becomes unreasonably slow. 3000 entries
is not even close to enough to be able to block even a small percentage of
the porn sites out there, much less anything else you might want to block.

It is possible to rewrite the Squid ACL code (or augment it with new ACL
types). This can remove the performance problem, but leaves the problem
of finding and tracking the unwanted URLs. This is a much harder
problem...

As a point of reference, finding and tracking "objectionable" URLs is
exactly what my company does. We have reviewed several million sites over
the past three years. Our ACL lists currently contain about 360,000
entries. These cover portions of about 110,000 sites (all of about half
of them, portions of the rest). These are all sites which contain
information generally considered inappropriate for kids - pornography,
extreme violence, bomb-and-drug-making recipes, etc. This list is not
comprehensive - our staff of approximately 50 reviewers finds thousands
more sites every day. So you can see that it is not a trivial problem.
Doing a reasonably good job requires lots of people, some pretty
sophisticated search technology (we work with Inktomi, which runs the
back-end behind HotBot, Yahoo, Microsoft's search engine, etc), and a
fancy database with a good interface to collect and store the ratings and
a lot of back-end scripts to ensure that the ratings remain up-to-date,
sites don't move without being followed, etc.

Incidentally, we haven't yet licensed our list to anyone because we have
not yet found a partner with ACL code fast enough to handle a list as
large as ours. This is why we sell a turn-key solution which includes our
ACL code which can handle the list. (We filter the Internet access for
about 5 million school kids, mostly in the US, but also in Canada, the UK,
Australia, and Mexico.)

Kevin

------------------------------------------------------------------------------
 Kevin Fink <kevin@n2h2.com> N2H2, Creators of Bess and Searchopolis
 Chief Technical Officer 900 Fourth Avenue, Suite 3400
 http://www.n2h2.com/ Seattle, WA 98164
 VOICE: 206-336-1501 / 800-971-2622 FAX: 206-336-1556
------------------------------------------------------------------------------
Received on Mon Sep 28 1998 - 17:42:21 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:42:12 MST