*Recording* active content via squid from John Stewart on 1997-01-26 (squid-users)

From: John Stewart <jns@dont-contact.us>
Date: Sun, 26 Jan 1997 16:44:14 -0800

Squid v1.1.5
SunOS 4.1.3_U1 Rev B

For reasons unimportant, a division in Cisco is trying to *track*
usage of active content (for statistical purposes, nothing more).
Since we were already releasing squid onto the various geographic
sites for this division, I began wandering through the code.

I've spent the last coupla days making modifications to store.c (and
others) so that it can record something like this:

ace.cisco.com - - [26/Jan/1997:15:58:31 -0800] "JavaScript http://www.netscape.com/index.html" - -

into a separate log file I've called "content.log". The log entry is
made for JavaScript, ActiveX, and Java when the appropriate tags are
"found" in the document. Two things I'd like opinions, and help, on.

1. It is it safe just to look for these following regexps:

<[ \t]*script[ \t]*language[ \t]*=[ \t]*[\'\"]*javascript
<[ \t]*applet[ \t]*code[ \t]*=[ \t]
<[ \t]*object[ \t]*id

Please know that before handing them off to the regexp parser, I
switched the contents to lowercase. (Does the regexp parser have some
smart routine that when you build up the regexp you can say ignore
case?) So case sensitive isn't an issue.

Also please know that I've removed \r and \n from the buffer I am
working with, since they are annoying side cases and can be dealt with
a " " just as easily during a regexp search.

2. I've been changing storeComplete in store.c. My big concern
is something which I'm hoping another of you can answer for me -- is
there anyway to get at the entire stored document in that routine,
without calling storeCopy to copy the contents into a buffer?

The struct has a "data" element which I'm getting nothing out of, and
since I cannot watch the flow *as it happens* (since the tag could be
split across two chunks of the flow (sounds bad...)) I need to review
the entire doc after it completes loading (e.g. storeComplete)

My concern is this part. In order to use the GNUregex stuff (found in
lib) I've had to copy the entire document in memory for a period of
time (*blech*) -- so I'm hoping there is instead a char* that I could
hand over to the regular expression parser -- but I can't *find* it.

Once this "mystery" is solved, I'll drop the entire patch on our
external ftp site and begin writing about it. Until then tho, this
patch still bugs me 'cuz I'm not convinced it is a the Right Thing
(tm).

does such a char* beast exist? thx -- John

------=------=------=------=------=------=------=------=------=------

Annoying block
/* Are we looking at HTML -- if not, don't bother */

   if (!strcmp(e->mem_obj->reply->content_type, "text/html")) {
        bufit = (char *)xcalloc(1, e->mem_obj->e_current_len + 2);

        len = e->mem_obj->e_current_len;
        /* Get a copy of the buffer -- the *entire* buffer* since
           the tag could split across small buffer pieces boundries.
           This part really bugs me, but I can't think of another way
       */
        storeCopy(e,
                  0,
                  e->mem_obj->e_current_len,
                  bufit,
                  &len);

------=------=------=------=------=------=------=------=------=------
Received on Sun Jan 26 1997 - 17:15:42 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:34:11 MST