Re: thoughts on memory usage...

From: Brian Denehy <B-Denehy@dont-contact.us>
Date: Thu, 21 Aug 1997 12:27:49 +1000

--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii

| On Wed, 20 Aug 1997, Michael Pelletier wrote:
| >Given MD5 code in Squid, it might also open the door to MD5 hashing of
| >document/image contents, so that mirror sites would not wind up with
| >redundant data in the cache.
|
| Worth thinking about. My cache currently shows zero Content-MD5 headers
| received in the 200,000 parsed today, even though Apache now supports
| it, so clearly nobody is using it yet on their servers to be helpful to
| caches.

Looking at numbers from a large set of caches, it looks as if there is about
one per 100K MD5 headers. The number does not seem to have changed much in the
past year or so.
|
| It would certainly be nice if Squid could remove duplicate objects in
| its own cache by calculating their content checksums (maybe without the
| headers?), and most caches have CPU to spare. Imagine having only one
| stored copy of that "Netscape Now" button. You'd still need the metadata
| of course - and the headers as well, to be able to reproduce exactly.
| And you'd still have to fetch the object if the server didn't give you
| Content-MD5.
|
| Similarly, it could be used for ICP, particularly when doing IMS-type
| queries. Isn't this what ETags are all about in HTTP 1.1?
|
| Ian Redfern (redferni@logica.com).

A problem with using MD5 or other hashing algorithms is that there is not a
one to one mapping between URLs and file content.

I need to dig out and rerun some work that Martin Hamilton and I both ran
about fifteen months ago which looked at duplicate objects in the cache. At
that stage I concluded it was not worth the savings to remove the duplicate
objects. Martin and I had caches of about 1G at that stage and the md5
distributions were very similar, though both represented samples from a larger
distribution as not all our URLs were the same even though the replicated
objects were very close. The checksums of those objects duplicated above ten
times are at the end of this message if anyone wants to see if they are part
of the same universe. When I've updated the scripts Martin wrote for the
current squid I'll make them available.

Another hash function, if you are worried about collisions in MD5, is the SHA
done by NIST in the US and also freely available. However, I'm not worried at
the moment by the probability that two URLs will hash to the same value - the
weakness identified is many many orders of magnitude larger than the space of
URLs

Brian

Top MD5s from at 1G cache at 10 May 1996. I recall that the netscape logo was
the most duplicated. Almost all the objects were small gifs.

  10 2769fa9b2155f0a42a63c876b82e8d10
  10 27e9e8ba45509fed6f40de1b8f1b2b17
  10 2b27fb579414d5c61e69a63af6028387
  10 3a9d67e56fec9ab92741b4d100f58f9c
  10 5b4f0398d4b2d46ed872ad334ff788f2
  10 6ff188efe9b73b337871ef8c7903c344
  10 b499b26a19ca8909022d456cd8d6ad93
  10 b84635dd6f1ff7af9badd668d9c3eb90
  10 bebf1b143e88576fde8562a820f7eb13
  10 e4a35288ed24375c2c9f66097b59c514
  10 ec1136d634a198e0db546a2892586b41
  11 36455f51e18bfeba991442c86576bca7
  11 53a889816de8d0f993bd3f268785a1ab
  11 8fb2ce2d330524f74720d1749e9bfe82
  11 ac214400189eb0e43ec3c71ea2b7986e
  11 f117fc35a2aaf45a04e4593c35c85e67
  12 7e95bcd117c8ddd88081b793251fd70d
  12 e2e5e9259cee4e91cded809772d901a4
  12 e32cdd943549de7f9e2e011d3a73b313
  13 1bd8b873df7c8579bb1921ed543a8b16
  13 5a05441d883254fc4fa2a0883caffced
  13 ab8f17eaad3afae03c121cbb8e136a68
  13 d8e1ca9d958e2c9347e5c3a8655b1488
  14 1e9f59d09bb0cf7897da61f18fd02198
  14 9d3b0822649d9aacec3d2c0a01844b7b
  14 b1d7db9b0275cea23283bb30c69caa1b
  14 e73075aa40c0cb4c3b8dc341ddba1163
  15 89fa565c9dc3dbbc190b61e8c47c81c7
  17 3cbdaa8efb8462c695da01da25cc0b80
  17 554965cb0bc2d005aa90590469f77b44
  17 b0081aef58d6eb99504a3bb456285845
  18 c4b26b4cba08e4efe6f73efaa8e8e4a1
  20 40aa4ed379670f1715948824a707ba8a
  22 947cf6f8eb16027f25dc55c066019e4c
  23 be5c80cf8c943581f63180a260590db7
  25 325472601571f31e1bf00674c368d335
  25 9ec7985192402f0374efc05614b1b382
  27 452c20d867de79f9eae20849e9968e9f
  28 924c759436bed5fc2e3b3ba8b2250b66
  30 c04f15ec0f07f640e7eaff429c3b265d
  31 ffd7a00954442494e78f8149267b6a13
  44 15a50eae26a3a7b0798d2ee4dbf3c801
  45 0e964ee8f1d4d816feb619a938dfe018
  46 9b73c96644cd10324c6baa402c473f1e
  52 2200dbf28eede16cca5bbdb38dd6e955
  52 d41d8cd98f00b204e9800998ecf8427e
  64 96a37c019ecfc2aac5a1a0b1debaa6cf
  64 d17b37630c0e26c25bc25e73b69f3c3a
  96 753700f4205a7216690a9b9dc1f56894
 129 860f61500e2ebfea82d06d12910ae749
 

---
Brian Denehy,			   Internet: B-Denehy@adfa.oz.au
Information Services Division 	   MHSnet:   B-Denehy@cc.adfa.oz.au
Australian Defence Force Academy   UUCP:!uunet!munnari.oz.au!cc.adfa.oz.au!bvd
Northcott Dr. Campbell ACT Australia 2600  +61 2 6268 8141 Fax +61 2 6268 8150
--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:42 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:24 MST