Re: Store_url_rewrite for squid 3+ from Eliezer Croitoru on 2012-09-09 (squid-dev)

From: Eliezer Croitoru <eliezer_at_ngtech.co.il>
Date: Sun, 09 Sep 2012 11:06:44 +0300

On 09/08/2012 01:36 AM, Alex Rousskov wrote:
> On 09/07/2012 01:32 PM, Eliezer Croitoru wrote:
>> On 09/07/2012 05:10 PM, Alex Rousskov wrote:
>>
>> I am not sure what "option" you are referring to in the above. The
>> Store::get(key) API I have described is not optional -- it is the
>> primary way of detecting a hit.
>>
>> +1 to use the api.
>> are we talking about the "Store::Root().get" ?
> That is where it starts, but other Store kids implement the get() method
> as well IIRC. From the caller perspective, you should also look at
> storeGetPublic*() and similar functions.
basic review Done already.
>> The URL rewriting must happen before or during Store key calculation.
>>
>> The problem is that if I am rewriting the url, unlike
>> url_rewrite\redirect the connection should be initiated based on the
>> original url and not the rewritten one.
> Sure, but that is not a problem you need to worry about much. Since you
> are not going to alter the original URL, the connection initiation code
> will not be affected by your changes (except for some minor adjustments
> if some data members get renamed or Store::get() API-related calls get
> adjusted as discussed below).
Got my answers during the review.

>
>> What I do not know yet is what and where and based on what the request
>> is done.
> For this project, you do not need to know much about connection
> initiation code. You need to make a copy of the original URL, rewrite
> that copy using the new helper, and put it in a new HttpRequest data
> member (for example). This should probably happen (or be initiated from)
> client_side.cc or client_side_request.cc.
Well after the review it seems like there is no need to copy the request
but to use the char * storeurl.
it makes more sense since you never touch any other part of the object
while you are getting it.
> You then need to make sure that every time Store calculates a store key,
> the new member (if any) is used instead of the original URL.
this is the original reason I wanted to know\find about the initiation
code.
but I managed to find what I need for now.

> Since the current code is using the original request URL where you want
> the rewritten/store URL to be used, I recommend _renaming_ all members
> that keep or extract the original/request URL (e.g., MemObject::url).
> This will force you to go through the relevant callers and make sure
> that the right URL is used, depending on the caller context.
Exactly what I was getting into.
> Once you are done, we can review whether the vast majority of changes in
> the diff are a simple renaming of "url" to "requestUrl". If yes, you can
> revert that renaming change. If there are many cases where old "url"
> became "storeUrl" and many cases where it became "requestUrl" then the
> change should stay.
>
> Please note that the above is just a sketch. I am not suggesting any
> specific new names, and I am using existing class names only as an example.
Will look into it more.
as far as I can tell most of the current code MemObject::url calls are
from debug.

>
>> so we need to store the original url and the store_url at least for the
>> period of time the request is being served.
> Yes, we need to keep both URLs around during the transaction lifetime.
> When I was talking about storing a URL, I was only talking about Store
> (i.e., storing something in the cache).
A reason I can think of is to make it possible to use
mgr:store_url_option that will maybe give some data about in cache
objects such as urls etc.
For most cases the access.log is what will be used but if there is a
need it's for it it's for that.

by the way the Purge tool uses the URL part of the metadata to find urls
and purge them in a ufs cache_dir.
since I stored objects in my cache with URL meta data as the rewritten
one I could do lookups on the cache objects using purge.
It gives you the ability to find out exactly at this moment what objects
urls are in the cache and in my case also rewritten ones.
>
>
>> (answer about storing or not the store_url in meta)
> That is a separate question. I cannot answer it, unfortunately. I do not
> know why Squid2 stored either of those URLs. I know Squid3 does not
> store the original/request URL.
It does in ufs cahe_dir as a fact. (see the below quotation from a cache
file)
>
>
>> url_rewrite api gets the http requests as the background to any action
>> so it cannot be used for store_url since any change to the request will
>> lead to the change we are trying to not do exactly.
>>
>> mem_object contains:
>> {...
>> char *url;
>> HttpRequest *request;
>> ...}
>> and the url is being used to make the hash but I still dont know what is
>> being taken from the mem_object to do the request.
> HttpStateData::buildRequestPrefix() builds request headers. It may use
> mem_object::url. I do not think you should change that code (apart from
> renaming and Store::Get() API changes discussed elsewhere).
>
> You should only be concerned with calls to storeGetPublic*() and such.
> Make sure the callers do not use mem_object::url without checking
> whether there is a "rewritten" store URL as well. You will probably want
> to change profile of most of those calls so that the caller does not
> need to worry about picking the right URL to use (the decision would
> happen inside storeGetPublic() and friends).
Got it on the review.
> How is that metadata used in Squid2? What is it used for? Nobody is
> suggesting that Squid2 does something wrong here, of course. We just
> need to understand _what_ it does so that we can either mimic what it
> does in Squid3 or do something better.
Better +1
since squid3 has another approach to the code and I can see it.

even a mimic for me is more then just "cp /.unknown
/.superunknown_partly_known"

>> since couple hours ago I noticed that it was done in order to make
>> things possible and not really to think in a more wide angle about the
>> effect of it.
> What things does it make possible?
what I meant was to solve the problem of de-duplication.
It was the soul purpose of the specific project\feature and it was late
added.
I can assume that maybe then the approach was to store in meta data
thing for a purpose.
if somebody do knows why the original url was embedded into the meta
data then it would maybe answer on the possible reason for adding the
storeurl into it.

> I hope we will not have to commit some code without understanding its
> effects.
Not a chance we will commit anything with this effect unless I will
actually understand what it effects.
I dont like to commit untested things.
> IIRC, when Squid rebuilds in-memory index from swap.state or from the
> directory directly, it does not need URLs. It needs keys which are
> stored in swap.state and in directory entries.
Good.
and means no helper in any part of the way should be except request
from: http,icp,htcp.(all methods)

> If a corrupted swap.state entry is detected, it is removed/ignored. If
> the whole swap.state is missing or its header is corrupted, then we
> rebuild the in-memory index from disk instead. Please note that not all
> cache_dir types use swap.state.
OK, but if the data is in the meta data it answers all the questions
needed from many angles of the glass.
1. we do not in any was need to store in the meta data the store url
beacuse all that is needed is seems in it already.
I tested the thing about the meta data and the results for the
trunk(updated for 2\3 days ago)

> f = File.open('../root/var/cache/squid/00/00/0000002A')
>l1 = f.gets
=>
"\u0003\xAE\u0000\u0000\u0000\u0003\u0010\u0000\u0000\u0000\xFAiUQ|\x918\xFFs\u007F\x86-7p;^\t,\u0000\u0000\u00001\xCAIP\u0000\u0000\u0000\u00001\xCAIP\u0000\u0000\u0000\u0000\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\v\xC0\xFCL\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000`\u0004\u0004B\u0000\u0000\u0000http://walla.yad2.co.il/walla.co.il/js/function.js?rnd=1624874682\u0000\u0004\n"
> l2 = f.gets
=> "\u0000\u0000\u0000STORETEST\u0000\n"
# the above lines is what originally stored in the meta data of a ufs
cache_dir while testing with a simple small addition of STORETEST to
make sure what the fate of this meta data.

the next is from a 3.head original code:
> f = File.open('../root/var/cache/squid/00/00/00000000')
> l1 = f.gets
=>
"\u0003\xA5\u0000\u0000\u0000\u0003\u0010\u0000\u0000\u0000+\eH\u001Ay\\\xEBC\x86\xBE[\xDBq\u001E\u0005\xEA\t,\u0000\u0000\u0000\xB2\xE4JP\u0000\u0000\u0000\u0000\xB3\xE4JP\u0000\u0000\u0000\u0000\x87\xF4\u001Fr\u0000\u0000\u0000\u0000\xD2KJP\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000`\u0004\u0004H\u0000\u0000\u0000http://l1.yimg.com/dh/ap/default/120907/yangtze_river_red_getty_uni.jpg\u0000\n"
> l2 = f.gets
=>
"\b\u0000\u0000\u0000\x98U\u0000\u0000\u0000\u0000\u0000\u0000HTTP/1.0
200 OK\r\n"

so it's clear that also there the original url is stored in meta data.
again dont know about other stores form.

in the storeSwapMetaBuild() at store_swapmeta.cc it's clearly used to
store the original url.
about the other stores I dont know since I dont use them.

> HTH,
>
> Alex.
>
Sure it is!

Eliezer
Received on Sun Sep 09 2012 - 08:30:03 MDT

This archive was generated by hypermail 2.2.0 : Sun Sep 09 2012 - 12:00:05 MDT