Re: [squid-users] SQUID store_url_rewrite from Amos Jeffries on 2011-05-31 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 01 Jun 2011 12:39:11 +1200

On Tue, 31 May 2011 20:47:13 +0300, Ghassan Gharabli wrote:
> Im sorry again for the last email but I also have something to ask
> for ..
>
>
> (m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
>
> now Im talking about this element ([\w\d]{2,4}) which seems to match
> .ex , .ext or .exte for example .mp3
>
> I understand that \w matches an alphanumeric character, including "_"
> same as [A-Za-z0-9_] in ASCII
>
> that I know it finds for numbers , letters including underscore ..
> which is correct here but the thing that is confusing ot me
> also we have used \d which finds for matches a digit same as [0-9] in
> ASCII.. so we have used 0-9 twice! any comment about it?

No idea. As you say, it seems to be redundant.

>
> Im also seeing these urls again
>
> #generic http://variable.domain.com/path/filename."ex", "ext" or
> "exte"
> #http://cdn1-28.projectplaylist.com
> #http://s1sdlod041.bcst.cdn.s1s.yimg.com
>
> ^ means that we matches the beginning of a line or string.
> m/^http:\/\/ ... we used at the start (.*?) which seems to be to find
> anything !

Yes.

>
> If we want to look at this url ;
> #http://s1sdlod041.bcst.cdn.s1s.yimg.com
>
> If Im correct then (.*?) means to match "s1sdlod041" and then the
> second element(\.[^\.\-]*?\..*?) we moved to . after
> "s1sdlod041" so nw we have "http://s1sdlod041." but I want to know
> how
> about "[^\.\-]*?\..*?" like [] or we used ^ for \. and \-
> coz we are also finding dashes or dots .. after that we used "*"
> anything! and then Question Mark "?" .. something also confusing to
> me
> "\.." or "\..*?" .

(.*?) should match the whole: "s1sdlod041.bcst.cdn.s1s" or
"evil.com/?url=http://blah". Then...

Maybe a bug: this should probably be: ([\w\-\.]?) to avoid that OR.

(\.[^\.\-]*?\..*?) matches: "yimg.com" or "yimg.com/blah/blah". Then...

Maybe a bug: this should probably be: (\.[^\.\-]*?\.[\w]*?) to avoid
that OR and make the next bit match the whole path instead of filename.

\/ matches a "/". Then...

([^\?\&\=]*) matches "filename" or nothing. Then...

\. matches a ".". Then...

([\w\d]{2,4}) matches some alphanumeric 2-4 bytes long. Then...

\?? matches a '?' or nothing. Then...

.*$ matches anything else.

Maybe a bug: these late two should probably be: (\?.*)?$ to avoid a
lot more evilness.

>
> another question to ask for ([^\?\&\=]*) umm I think this one is for
> folders or what ?...
>
> as I saw the slash \/ before it .. which seems to catch
> /?url=blah&C=blah2 and the "*" matches "blah" and "bla2"
>
> but please if you dont mind then you can explain or illustrate more
> about (\.[^\.\-]*?\..*?) or maybe you can explain it well

see above.

>
> using your way as Im sure you are a good teacher hehehe
>
> Please explain the whole match to me
>
> (m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
>

above.

>
> I was eager to ask you all these questions from the start but I was
> afraid thinking you'll not help anyway
>
> that what I was trying to go so far is FileHippo domain
>
>
> http://fs34.filehippo.com/6574/058e5771e07c467cb38d70ab6fbed3c0/Opera_1150b1_int_Setup.exe
>
> in this case we have to try to change the domain into
> "cdn.filehippo.com/6574/Opera_1150b1_int_Setup.exe" because we
> removed
> the hashed folder!
>
> Its okay I have the script for it
>
>
> #cdn, varialble 1st path
> } elsif (($u =~ /filehippo/) &&
> (m/^http:\/\/(.*?)\.(.*?)\/(.*?)\/(.*)\.([a-z0-9]{3,4})(\?.*)?/)) {
> @y = ($1,$2,$4,$5);
> $y[0] =~ s/[a-z0-9]{2,5}/cdn./;
> print $x . "http://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] .
> "\n";
>
> and its working 100% . I can get it from cache too .. what if I want
> to add wlxrs.com into ($u =~ /filehippo|wlxrs/)
>
> does that match this URL?
>
> http://css.wlxrs.com/HGjlAVvMlW6-1!iEEpuBkgo2TZKpU8RH!W4mH-UPgteZ8OD6Oxte!sCQWfQ1OB7A6B-NZoBS1jrItq7zq!v10A/OOB_30_IllustratedKai/15.40.1211/img/Kai_Sunny_thumbnail.jpg
> I dont think so as it has "!" where should I add this one to match a
> folder like
>
> "/HGjlAVvMlW6-1!iEEpuBkgo2TZKpU8RH!W4mH-UPgteZ8OD6Oxte!sCQWfQ1OB7A6B-NZoBS1jrItq7zq!v10A/"

It will. The "([^\?\&\=]*)" pattern does not prevent '!' or any other
valid weird characters.

>
> sometimes the CDN folder comes at the 1st folder or 2nd or 3rd ..
> deopends on any website.

Yes. This is back to the knowing fine details about what the individual
website or CDN. The changes done have to be customised to individual
sites. If they change anything you have to alter the patterns.

>
> can you lead me where should I find or edit this script to follow
> WLXRS.COM

The second maybe-bug I pointed out before, when fixed should make $3
have the whole file path for you to play with.

Amos
Received on Wed Jun 01 2011 - 00:39:17 MDT

This archive was generated by hypermail 2.2.0 : Wed Jun 01 2011 - 12:00:04 MDT