Re: Inline content modification?

From: Robert Collins <robert.collins@dont-contact.us>
Date: Tue, 16 Jan 2001 10:42:02 +1100

I've been working on several things in this area Joe....

From a 'where to make the change' angle you have two choices: client_side, where you make the change on every request (read CPU
hog), or in http.c (aka server_side!) where you can modify the data coming into squid.

the no_anim patch gives you example code to alter outbound data: Note you should remove the content-length header unless your URL's
are the same length. You probably don't want to buffer the whole page to recreate the header though.

You can look at the changes to http.c in the te branch on squid.sourceforge.net to see how to alter incoming data. The filter model
that Patrick McManus put toghether for te codings would also make sense for in-squid data modifications (process the data recieved
chunk be recieved chunk). (Although it wouldn't be marked as a te coding :-]).

The advantage of altering the incoming data is that a) the modifications get cached. and b) after the first retrieval, you can
recalculate the content-length for future requests, keep http/1.0 persistent conns happy. I don't suggest you touch the te code just
yet, unless this is a medium term project :-]

In the filter code you could use callbacks if you need external helpers (I'm already considering the need for that), but you'll have
to split the htttp function that calls perform_te (for me - for you performurlrewrite/...) If you want to head down that path letme
know and I'll split it up for you (save duplicate work)..

As far as setting ACL's etc to do it, sounds like it'd be fairly flexible. A in-process rewrite will introduce less latency for
MISS's though. (I'm assuming you alter incoming data).

----- Original Message -----
From: "Joe Cooper" <joe@swelltech.com>
To: "Squid Dev" <squid-dev@squid-cache.org>
Sent: Tuesday, January 16, 2001 10:10 AM
Subject: Inline content modification?

> Hello all,
>
> I have a couple of queries that I hope someone here can answer for me,
> since I don't have much familiarity with this part of the code.
>
> First a bit of background. I'd like to add a small hack (and possibly a
> completely configurable squid.conf option) to provide content
> modification of pages as they pass through the cache. (I know, Red
> Flag, "Danger Will Robinson!", Copyright issues!) This is for a website
> accelerator that will be located in India and accelerating several sites
> in other countries. Because of the layout of the network and this sites
> partner sites (who serve users all over the world) they are not able to
> modify the links on the partner sites pages, and some of them are
> absolute links--so those links will cause the client to be bumped off of
> the proxy and onto the much slower and more distant origin server. We
> can pretty easily get the entry page through the cache, but from there,
> I need to be able to modify http links to direct through the cache--then
> a redirector will direct the cache back to the origin server.
> Complicated, I know.
>
> Here is what I envision doing:
>
> Provide Squid a list of URL ACL's to match and rewrite, like so...
>
> acl_dstdomain somesite http://www.somesite.com/
>
> And in keeping with the current ACL style, a rewrite rule...
>
> url_rewrite somesite http://www.accelhost.com/www.somesite.com/
>
> This URL could then be redirected via Squirm, or similar, to translate
> it to the actual origin server, including whatever comes after the
> domain name.
>
> So...Now to the questions:
>
> Am I an idiot? It appears to me that it is possible to read and work on
> all of the object in client_side.c, and the noanim patch posted here a
> few weeks ago does just that without problems. But I very well could be
> missing something.

It doesn't cache the results. String matching on blocked data isn't the cheapest operation, and doing it n * without caching the
results seems silly to me.

>
> Assuming it is possible, can I use the ACL interface to generate the
> match lists, or do I need to come up with a method to handle the match
> string and the replacement string? It would be nice to have a named ACL
> for the match strings, and it seems reasonable that this would work. So
> can I run /anything/, including whole html pages, through a regex or
> string matching ACL? Anyone have pointers for how to tackle this one?

I think you need a new ACL - you may have data comin and be sitting on the block boundary
(ie s/jobloggs/johnloggs/ - the first block of data you receive may be

asdasdasdasdasdjob
and the second block
loggs is a strange person

you will need to buffer the possible string hit 'job' and not send it on until you've seen the second block or hit EOF & flush any
buffered data.

>
> Finally, I welcome comments and suggestions for how best to proceed
> and/or anything to look out for when implementing this.
>
> Thanks.
> --
> Joe Cooper <joe@swelltech.com>
> http://www.swelltech.com
>
>
Received on Mon Jan 15 2001 - 16:30:32 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:13:18 MST