MD5 and URL validation (continue to other very old thread)

From: Eliezer Croitoru <eliezer_at_ngtech.co.il>
Date: Wed, 21 Nov 2012 21:06:02 +0200

Last time long ago There was a talk about URL storing the original
request URL at the swap_file Meta data.

Now it strikes me again while testing something.
the code of:
http://bazaar.launchpad.net/~squid/squid/trunk/view/head:/src/StoreMetaURL.cc#L39
   (25 lines of code)

##start
bool
StoreMetaURL::checkConsistency(StoreEntry *e) const
{
     assert (getType() == STORE_META_URL);

     debugs(20, DBG_IMPORTANT, "storeClientReadHeader: URL
checkConsistency wasn't used ");
             return true;

     if (!e->mem_obj->original_url)
         return true;

     if (strcasecmp(e->mem_obj->original_url, (char *)value)) {
         debugs(20, DBG_IMPORTANT, "storeClientReadHeader: URL mismatch");
         debugs(20, DBG_IMPORTANT, "\t{" << (char *) value << "} != {"
<< e->mem_obj->original_url << "}");
         return false;
     }

     return true;
}

##end

The code responsible to check the consistency of a cached file\object
URL against the current requested URL.
It's being used at store_client.cc and move from there in newer revisions.
In the old revision 4338 it states that the meaning of this code is:
"Check the meta data and make sure we got the right object."

The problem is that it only being checked while a file is being fetched
from UFS(what I have checked) while from RAM it wont be checked.
The result is that when store_url_rewrite feature is being used the
check points on inconsistency between the request url and the object in
cache_dir (naturally).

Disabling this check will make my life easy with store_url making it
from "not" to "works".

So I have couple options how to "fix" the issue:
1. disable this check.
2. disable this check for only store_url_rewritten requests.
3. adding the store_url meta object into the cache file and use it to
identify the expected url.
4. add on\off switch to disable this check.
5. others?

After a small talk with alex I sat down and made some calculations about
MD5 collision risks.
The hash used to make the index hash is a string from "byte + url".
For most caches that I know of there is a very low probability for
collision considering the amount of objects and urls.

Yes we are talking about many many objects and it is possible but it's
not only the URL hash but some other unknowns like request and response
headers which makes this whole calculation a bit far from reality to hit
and taking it from 2^64 chance of collision to more then 2^124.
It seems to me like it will take some amount of time until I will
see(never seen) hash collision.

What do you think?
Have you seen real world scenario of collision?

Eliezer
Received on Wed Nov 21 2012 - 19:06:26 MST

This archive was generated by hypermail 2.2.0 : Thu Nov 22 2012 - 12:00:08 MST