RE: Deny access for non standard browsers

From: Dave J Woolley <DJW@dont-contact.us>
Date: Fri, 23 Jul 1999 16:17:23 +0100

> From: Benarson Behajaina [SMTP:Benarson.Behajaina@swh.sk]
>
> I reconfigured my Squid to deny access for non standard
> browers (GetRight, Wget, Lynx, GetSmart etc ...)
>
        Why? Is your HTML that badly broken? Lynx is probably
        more standard (in the sense of HTML/HTTP, etc.,
        compliance) than Netscape 4.x, and is actually
        reccommended on the squid home page!

        The only real effect of discriminating against Lynx is
        a lot of mail hostile to you on the lynx-dev mailing
        list and an increase in the number of people who override
        the User Agent. wget users have the same option.

        If this is really an attempt to exclude crawlers (and
        IE 4 is a crawler, although I think it changes its user
        agent string when crawling), then I admit Lynx is weak in
        not supporting robots.txt, but wget when actually crawling,
        is certainly compliant - it is also used as a front for
        other tools. You can force wget to be badly behaved, but
        then you can write your own crawler or modify the source code
        quite easily.

        From what I've heard of IMDB's attempts to control crawlers
        and pre-fetchers, the main problem is from ones that do
        not identify themself in the user agent. IMDB analyse the
        log files, presumably to look for the typical access patterns.
        Most of these will not be configurable like the power users tools,
        Lynx and wget, but will be typical Windows plug and play
        shareware.

        Also, some people suppress user agent in their proxies for
        privacy reasons.

        The first thing to do if you don't want to be crawled is to
        make sure that you:

        - have a policy that makes sense to the users;

        - have a robots.txt file that accurately implements that policy
          and is commented to explain the policy;

        - explain the policy clearly in a way accessible to interactive
          browsers.

        Specifically with Lynx, you should donate code to support
        robots.txt when operating in crawling mode. People may still
        disable it, but those people will try to frustrate any attempt
        you make to shift the balance in favour our your advertisers,
        etc.

        (Incidentally, someone is getting quite heavily flamed by
        most of the Lynx developers at the moment for trying to
        defeat IMDB's measures - most of the developers are sympathetic
        to the wishes of content providers.)

        Hope I've read correctly between the lines here.
Received on Fri Jul 23 1999 - 09:14:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:47:30 MST