Re: MD5 and URL validation (continue to other very old thread)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 22 Nov 2012 15:58:45 +1300

On 22.11.2012 08:06, Eliezer Croitoru wrote:
> Last time long ago There was a talk about URL storing the original
> request URL at the swap_file Meta data.
>
> Now it strikes me again while testing something.
> the code of:
>
> http://bazaar.launchpad.net/~squid/squid/trunk/view/head:/src/StoreMetaURL.cc#L39
> (25 lines of code)
>
> ##start
> bool
> StoreMetaURL::checkConsistency(StoreEntry *e) const
> {
> assert (getType() == STORE_META_URL);
>
> debugs(20, DBG_IMPORTANT, "storeClientReadHeader: URL
> checkConsistency wasn't used ");
> return true;
>
> if (!e->mem_obj->original_url)
> return true;
>
> if (strcasecmp(e->mem_obj->original_url, (char *)value)) {
> debugs(20, DBG_IMPORTANT, "storeClientReadHeader: URL
> mismatch");
> debugs(20, DBG_IMPORTANT, "\t{" << (char *) value << "} != {"
> << e->mem_obj->original_url << "}");
> return false;
> }
>
> return true;
> }
>
> ##end
>
> The code responsible to check the consistency of a cached file\object
> URL against the current requested URL.
> It's being used at store_client.cc and move from there in newer
> revisions.
> In the old revision 4338 it states that the meaning of this code is:
> "Check the meta data and make sure we got the right object."
>
> The problem is that it only being checked while a file is being
> fetched from UFS(what I have checked) while from RAM it wont be
> checked.
> The result is that when store_url_rewrite feature is being used the
> check points on inconsistency between the request url and the object
> in cache_dir (naturally).
>
> Disabling this check will make my life easy with store_url making it
> from "not" to "works".
>
> So I have couple options how to "fix" the issue:
> 1. disable this check.
> 2. disable this check for only store_url_rewritten requests.
> 3. adding the store_url meta object into the cache file and use it to
> identify the expected url.
> 4. add on\off switch to disable this check.
> 5. others?
>
> After a small talk with alex I sat down and made some calculations
> about MD5 collision risks.
> The hash used to make the index hash is a string from "byte + url".
> For most caches that I know of there is a very low probability for
> collision considering the amount of objects and urls.
>
> Yes we are talking about many many objects and it is possible but
> it's not only the URL hash but some other unknowns like request and
> response headers which makes this whole calculation a bit far from
> reality to hit and taking it from 2^64 chance of collision to more
> then 2^124.
> It seems to me like it will take some amount of time until I will
> see(never seen) hash collision.
>
> What do you think?

I think the usual method of calculations for hash collisions are a
little biased towards an even distribution of bytes. Whereas real-world
URL space is a lot tighter - with a far greater similarity between any
two similar-length URLs than in normal text of same length.
  I'm not certain what effect this has on the hash or how best to
compensate though.

> Have you seen real world scenario of collision?

No.

I'm kind of having the opinion that we should try (2) if that non-UFS
are also skipping it anyway.

Amos
Received on Thu Nov 22 2012 - 02:58:50 MST

This archive was generated by hypermail 2.2.0 : Fri Nov 23 2012 - 12:00:08 MST