Re: [squid-users] SQUID store_url_rewrite from Amos Jeffries on 2011-05-30 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 31 May 2011 15:41:15 +1200

On 31/05/11 11:54, Ghassan Gharabli wrote:
> Hello again,
>
> #generic http://variable.domain.com/path/filename."ex", "ext" or "exte"
> #http://cdn1-28.projectplaylist.com
> #http://s1sdlod041.bcst.cdn.s1s.yimg.com
> #} elsif (m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
> {
> # @y = ($1,$2,$3,$4);
> # $y[0] =~
> s/([a-z][0-9][a-z]dlod[\d]{3})|((cache|cdn)[-\d]*)|([a-zA-A]+-?[0-9]+(-[a-zA-Z]*)?)/cdn/;
> # print $x . "storeurl://" . $y[0] . $y[1] . "/" . $y[2] . "."
> . $y[3] . "\n";
>
>
> Why we had to use arrays in this example.
> I understood that m/ indicates a regex match operation , "\n" to break
> the line and we assined @y as an array which has
> 4 values we used to call each one for example we call $1 the first
> record as y[0] ..till now its fine for me
> and we assign a value to y[0] =~ $y[0] =~
> s/([a-z][0-9][a-z]dlod[\d]{3})|((cache|cdn)[-\d]*)|([a-zA-A]+-?[0-9]+(-[a-zA-Z]*)?)/cdn/;
> ...
>
> Please correct me if im wrong here.Im still confused about those
> values $1 , $2 , $3 ..
> how does the program know where to locate $1 or $2 as there is no
> values or $strings anyway
> as I have noticed that $1 means an element for example
> http://cdn1-28.projectplaylist.com can be grouped as elements .. Hope
> Im correct on this one
> http://(cdn1-28) . (projectplaylist) . (com) should be http:// $1 . $2 . $3
>

m// produces $1, $2, ... $9 for each () element in the pattern.

s// will produce different $1, $2, ... etc. You have to save the ones
from m// somewhere if you want to use them after s//. The person who
wrote that saves them in the array y[].

> Then let me see if I can solve this one to match this URL
> http://down2.nogomi.com.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.nogomi.com/M15/Alaa_Zalzaly/Atrak/Nogomi.com_Alaa_Zalzaly-3ali_Tar.mp3
>
> so I should work around the FQDN and leave the rest as is, please if
> you found any wrong this then correct it for me
> #does that match
> http://down2.nogomi.com.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.nogomi.com/M15/Alaa_Zalzaly/Atrak/Nogomi.com_Alaa_Zalzaly-3ali_Tar.mp3
> ??
> elsif (m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
> {
> @y = ($1,$2,$3,$4);
> $y[0] =~ s/[a-z0-9A-Z\.\-]+/cdn/
> print $x . "storeurl://" . $y[0] . $y[1] . "/" . $y[2] . "." .
> $y[3] . "\n";
>
>
> does this example matches Nogomi.com domain correctly ?
>
> and why u used s/[a-z0-9A-Z\.\-]+/cdn/
>
> I only understood that you are mnaking sure to find small letters ,
> cap letters , numbers but I believe \. is to search
> for one dot only .. how about if there is 2 dots or more that 3 dots
> in this case! .. another one u r finding dash ..

That pattern ends with "+". To search for "one or more" of the listed
safe domain letters.

It matches all of the $y[0] content:
"down2.nogomi.com.xn55571528exgem0o65xymsgtmjiy75924mjqqybp".

While also not-matching bad things like
"http://evil.com?url=http://nogomi..."

(the one I gave will covert "http://evil.com?url=http://nogomi..." -->
"cdn://evil.com?url=http://nogomi..." )

>
> The only thing im confused about is why we have added /cdn/ since the
> url doesnt has a word "cdn"?

This is a "s//" operation. ('s' meaning 'switch'). *IF* the $y[0] value
matches the pattern for a domain s// will place "cdn" instead of that
matched piece.

So what this does is change *.nogomi.com --> "cdn.nogomi.com"

If there are any bad stuff like my evil.com example going on it will
screw with those URL as well. BUT the bits there will not map to
"cdn.nagomi.com" so will not corrupt the actual CDN content.

Thinking about it a bit more I should have been more careful and told you:
s/^[a-z0-9A-Z\.\-]+$/cdn/

which will ONLY match if the $y[0] as a whole is a valid host name text.

>
> Why we have used storeurl:// because I can see some of examples are
> print $x . "http://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . "\n";
>
> can you give me an example to add the portion of $y[1] please..

  elsif
(m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)
  {
    @y = ($1,$2,$3,$4);

    if (m/$y[1]/nagomi.com/) {
      $y[0] =~ s/[a-z0-9A-Z\.\-]+/cdn/
    } else {

$y[0] =~
s/([a-z][0-9][a-z]dlod[\d]{3})|((cache|cdn)[-\d]*)|([a-zA-A]+-?[0-9]+(-[a-zA-Z]*)?)/cdn/;

}

print $x . "storeurl://" . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3]
. "\n";

}

>
> Which one have your interests , writing a script to match the most
> similar examples in one rule or writing each script for each FQDN?

The example you started with had some complex details built into its s//
matching. So that particular CDN syntax would be detected and replaced.
This is useful if the CDN is only some sub-domains of the main site.
And there are other non-CDN subdomains to be avoided. The nasty CDN.

The one I've just put above is for use when the site just uses all its
subdomains as CDN for the same content. These are the semi-friendly CDN.
You can extend that for other CDN by adding their base domains to the
m// test. ie if (m/(nagomi|example)\.com|example\.net/) ...

  This is only safe when the subdomain portion has meaning to the CDN
operator as:
  a) their client account token
  b) their data center routing tagging
  c) their load-balanced server hostname
  .. or similar internal *routing* details.

If there is any content clash on the URL-path portion it cannot be
done like this. As I said at the start you MUST BE CERTAIN of the
meaning of the bits you are removing.

The horribly nasty ones (looking at akamai here) need the other end of
the CDN domain stripped off. exmaple.com.akamai.com --> example.com

None of these patterns we have talked about so far is suitable for those
ones.

>
> for example sometimes we see
> http://down2.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.example.com/folder/filename.ext
> or
> http://cdn.xn55571528exgem0o65xymsgtmjiy75924mjqqybp.example2.com/xn55571528exgem0o65xymsgtmjiy75924mjqqybp/folder/filename.ext
>
> really that is interesting to me , that is why I would love to match
> this too as well but the thing is if I knew all of these things ..
> everything would be fine for me

That is getting extra complex.

I think you have so far been obfuscating the actual URLs and details for
examples. Domain bits are relatively easy due to their limited character
set.

Path bits must be coded for particular instances often with exacting
knowledge. hello.txt and hello.Txt are not necessarily the same file for
example.
Bes sure what you "know" is correct, and that your patterns work. Then
test, re-test, and test again. Then when its working. re-test.

>
> Again I want to thank you for answering my questions as I felt like Im
> writing a magazine heheheh

And you get a book back :)
Welcome. Though this is about as far as I can go on examples and
generics. My own skill with regex is not that great. (10 years in and
I'm still learning "common knowledge" details about it.)

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.12
   Beta testers wanted for 3.2.0.7 and 3.1.12.1

Received on Tue May 31 2011 - 03:41:23 MDT

This archive was generated by hypermail 2.2.0 : Tue May 31 2011 - 12:00:03 MDT