Re: acl, http_access, and urlpath_regex

From: <josh@dont-contact.us>
Date: Tue, 4 Jan 2000 12:06:13 -0500

I know this is documented in squid.conf but I seem to keep making
little mistakes in regular expressions and have expectations that are
wrong. Perhaps I should try to set up a little file with more examples
that could be added to the FAQ. I know I'm not the only one who has
problems with these. I hope no one minds my working through this on
the list. I think I understand how to test, but I'm uncertain outside
of url_regex which takes the whole URL. I probably should start
looking at the code to see where the different elements get parsed. I
think my only problems come from failing to understand where an
expression begins and ends from squids point of view and I'm using
only one field in the output of the logs for testing whereas squid
uses more information. This is a little too long to be posting - I
could probably work out a PERL script or shell script that would
answer my own questions if anybody is interested.

1. This is a problem in and of itself. I grab urls from the log, but I
don't know how to grab the pieces of URL as Squid would parse them
before the check.

First lets grab some URLs from the log
tail -20 /var/log/squid/access.log | awk '{print $7}' >/tmp/urls

2. Lets save some regular expressions.
[josh@saratoga:~]$ cat >/tmp/regex
$htt.*
^http?
com$
\.pl

3. Let me see what they match
#!/bin/sh
urls=/tmp/urls
regex=/tmp/regex

for i in `cat $regex`; do
    echo "$i"
    egrep -n $i $urls
done

or simply type them by hand

[josh@saratoga:~]$ egrep -n ^http://www\..*\.jpg$ /tmp/urls
8:http://www.intellicast.com/images/icons/87_wtext.jpg
14:http://www.intellicast.com/images/icons/82_wtext.jpg

This only tests url_regex since these are entire urls. So what happens
if I just type in 'www.example.com' as the first line in the file.

now ^[w+]\. will match that. but I'm not sure if that will work for a
dstdom_regex, because I'm still confused if squid takes www as the
beginning of the expression. On a practical level it almost makes
sense to ignore "www" at the beginning since it is used so commonly
that it is hardly worth checking. Of course it might be fun to set up
a site that does not use 'www' and then allow only sites that don't
start with www using ^[^w+]\.

On Mon, Jan 03, 2000 at 08:33:10PM +0100, Henrik Nordstrom wrote:
> josh@saratoga.lib.ny.us wrote:
>
>
> urlpath_regex only matches the path after the server name
> Lets take a URL like
> http://www.example.com:8080/some/path/to/a/file.txt
>
> Divided up on the different ACL types:
> protocol matches http
is
acl noht proto ^h[t+]p?
a valid acl for blocking http and https
> dstdomain and dstdom_regex matches www.example.com
is

acl no-w dstdomain ^[w+]
http_access deny no-w

valid for denying access to every site with 'www' as the start of their domain?
> dst matches the IP address of www.example.com
> port matches 8080
is it matching the port number as text in the URL or detecting the port being
requested. Can I simply block port 80.
> urlpath_regex matches /some/path/to/a/file.txt

does ^\/s match

> url_regex matches the while URL
>
>
> All regex ACL types uses so called "extended" regular expression,
> matching the string of the acl type scope.
> Squid hacker

On the bright side

acl chatkiller urlpath_regex chat.html$ \/Chat\/ \/chat\/ \/echat\/
 chat.htm$ c\ hat.pl$ chat.js$

is successfully stopping most chat sites with no trouble. And not
blocking too much else.

-- 
Josh Kuperman                       
josh@saratoga.lib.ny.us
slowly learning more about regular expressions than I intended.
Received on Tue Jan 04 2000 - 10:15:16 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:50:13 MST