Re: read HTML text

From: Henrik Nordstrom <henrik@dont-contact.us>
Date: Mon, 24 Apr 2006 16:27:49 +0200

mån 2006-04-24 klockan 11:04 -0300 skrev wellington ricardo gasparin:

> I want to read the body of a web page, this way I will create a vector
> model of semantics. Through distance semantics it will tell which
> objects will stay or will remain in cache.
> It is to make a new policy replacement.

Then you probably should hook into StoreAppend. All data going into the
object is seen here. But as always you only see a small fragment at a
time, never the complete object.

Please also note that Squid adds the object to the removal policy as
soon as the HTTP headers is known, before the complete reply has been
seen. So you also need to add a method whereby the information you have
collected is communicated to the removal policy. The simplest would be
to keep the information attached to the StoreEntry and to somehow extend
the swap.state format with your data to have it preserved across
restarts.

Regards
Henrik

Received on Mon Apr 24 2006 - 08:28:02 MDT

This archive was generated by hypermail pre-2.1.9 : Mon May 01 2006 - 12:00:03 MDT