Re: Log web page TITLE to access.log

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 05 Jan 2012 11:43:43 +1300

 On Wed, 28 Dec 2011 13:02:14 +0300, bsl wrote:
> Hello.
>
> I want to add page title to squid log for view the user's surfing
> history.
> Thank's to Henrik Nordstrom and his reply at 2006 about this :-)
>
> http://www2.tr.squid-cache.org/mail-archive/squid-dev/200603/0009.html
>
> Following his idea I parse web page content in function sendMoreData
> of client side routines (client_side_reply.cc)
> I found the page title and log it to access.log using new logformat
> token (for example "<tp").
>
> But I have the problem:
> The page title is not always logged.
> For example I visit www.godaddy.com - I see in log his page title.
> I visit www.nasa.gov - I don't see title in log :(
> What I was wrong? Maybe not all pages are given to the client through
> the client_side_reply::sendMoreData function?

 Doing things with the body in the middle is not as easy as you seem to
 think...

  1) the body is often compressed for transfer.

  2) the body may be missing entirely on HTTP/1.1 revalidation
 transfers.

  3) Consider what you would have to do to display the TITLE tag when
 the response body is bytes 20-50 of a 150 byte compressed object. Those
 bytes could very well be the "<title>hello</title>" part of an HTML
 page, but to decompress it while printing the log line is a difficult
 problem.

  4) The <title> and </title> may be split between packets, either each
 in separate packets, or a packet boundary inside the tag itself (ie
 "<ti" then "tle>" as two packets). Squid does its best not to buffer and
 delay the body contents. This type of response will not be detected by
 your scanner routines.

  5) There are some compression types (SDCH done by Chrome for example)
 which are binary diff patches on top of a particular representation of
 an object, which itself may be a result of applying a series of previous
 patches.

  6) There are object which are not HTML transferred which content TITLE
 tag look-alikes. XML, JSON, and AJAX responses for example. All of these
 will add false entries in your log unless you are careful to check for
 content types.
  ** www.nasa.gov is sending out XML objects which contain several
 nested copies of an HTML page. TITLE appears multiple times. They have
 the nasty bug of calling it "text/html" though, so even content type
 checks will fail here.

  7) TITLE can contain anything. Including binary codes. This will screw
 your log unless you have URL-encoding defined as the quoting style.
   ** www.godaddy.com pages wrap their title in binary bytes.

 I suggest making an eCAP adapter instead of a patch against Squid.
 Squid is being architectured in such a way as to make access to the body
 content through eCAP/ICAP easy. They still receive body data in snippets
 as described in (4), but have the option of buffering it if they need
 to.
 You can also do things like scan the first N bytes then speedily skip
 the rest of the object by instructing Squid bypass the scanner for the
 rest of the object.

 NOTE: there is a registered header "Title:" you can log. Or, if
 missing, add with an adapter scanning for details in the body.

>
> Thank for any idea.
>
> I made the following changes: (squid 3.1.10, freebsd 8.2 stable,
> amd64)
>
> AccessLogEntry.h:
> + added char *title; to AccessLogEntry class definition (public
> section, line 54);
>
> access_log.cc:
> + added LFT_REPLY_PAGE_TITLE to end of enum logformat_bcode_t
> definition
> + added element "<tp" for LFT_REPLY_PAGE_TITLE to struct
> logformat_token_table
> + added new case to function accessLogCustom():
> case LFT_REPLY_PAGE_TITLE:
> if (al->title) {
> out = al->title;
> quote = 1;
> dofree = 1;
> }
> break;
>
> client_side_reply.cc:
> In function sendMoreData() line 2078 I added block for parsing
> buffer:
> if (http->al.title == NULL) {
> // search TITLE tag
> const char *tag1 = "<title>";
> const char *tag2 = "</title>";
> char *ans1 = strstr(buf, (char *)tag1, result.length-7); //
> search open tag in buf (length in result.length minus length of tag)
> if (ans1) {
> char *ans2 = strstr(ans1+7, (char *)tag2, result.length -
> (ans1-buf)-7); // search close tag in rest of buffer
> if (ans2) {
> int titlelen = ans2 - ans1 - 7; // title length
> http->al.title = (char *)xcalloc(titlelen + 1,1);
> xstrncpy(http->al.title, &ans1[7], titlelen);
> }
> }
> }
>
> Realisation of strstr function:
> char * strstr (char *haystack, char *needle, int strlen)

 What you define here is an implementation of strnstr(), *not* strstr().

 Your search is also case-sensitive. HTML tags are case agnostic by
 definition. <TITLE> and <Title> are two common variations you will miss.

 Amos
Received on Wed Jan 04 2012 - 22:43:47 MST

This archive was generated by hypermail 2.2.0 : Thu Jan 05 2012 - 12:00:07 MST