[squid-users] Re: International Domain Names

From: Henrik Nordstrom <hno@dont-contact.us>
Date: 29 Jan 2003 15:43:17 +0100

ons 2003-01-29 klockan 12.16 skrev Joel Rowbottom:

> I represent a company called Characterisation which is providing an interim
> IDN solution - Verisign are also implementing their own system which is
> similar. Both require 8-bit clean to be passed from the resolver, which
> Squid doesn't do.

One problem is that Squid receives requests over HTTP, and in HTTP only
ascii characters is allowed for host names. This is defined both as a
standard for URLs and as a standard for the HTTP protocol as such. For
Squid to properly parse and understand the HTTP request and URL it must
know the character encoding used, and as the only standardized character
encoding for URLs within HTTP is limited ASCII and this is what Squid
assumes is used.

The other problem is that due to the nature of Squid beeing a proxy it
not only passes data like a router but also needs to understand how to
read the data. To correcly understand and use the data a understanding
of the encoding used is often required.

This said, see also
http://www.squid-cache.org/Versions/v2/2.5/bugs/#squid-2.5.STABLE1-hostnames for a workaround making Squid ignore most of this. However, this is far from perfect and opens a new can of worms until there is a standard on how to handle IDN names.

> Surely a proxy request should be transparent, rather than imposing its own
> rules? If not proxied then the requests work fine, but if through Squid
> then it whinges?

Squid aims at beeing semantically transparent for requests within the
defined standards. binary FQDN hostnames is not part of any standard or
even Internet Draft (including DNS).

Note: Squid needs to understand the structure of FQDN hostnames for many
purposes.

* Parsing of HTTP, to be able to isolate the hostname component and it's
structure in domain labels.
* Access controls, comparing labels and pattern matching within FQDN
names.
* Logging
* To convert hostnames to DNS labels when resolving into IP addresses

> I'd be interested in the "standard" which states "The current Internet
> standards is very strict on what is an acceptable hostname and only accepts
> A-Z a-z 0-9 and - in Internet hostname labels. Anything outside this is
> outside the current Internet standards and will cause interoperability
> issues such as the problems seen with such names and Squid." -- an RFC
> would be ideal ;)

Here is the most obvious ones in this scope, but there is many many more
if you care to study the subject:

STD0003 Requirements for Internet Hosts
RFC2616 Hypertext Transfer Protocol -- HTTP/1.1
RFC1738 Uniform Resource Locators (URL)
RFC2396 Uniform Resource Identifiers (URI): Generic Syntax

To summarise the current situation:

1. The DNS protocol allows (and has always allowed) any data to be used
in DNS labels. This because the DNS protocol as such is application
neutral and not limited to resolving Internet host names. However, it is
assumed the data is ASCII when comparing domain labels as labels are
case-insensitive.

2. All standard documents which refers to Internet host names or
Internet domains (including their namespace structure within DNS) limits
such names to use a-z 0-9 - case insensitive labels. There is Internet
Draft documents discussing various approaches on how to get beyond this,
but none of these has yet to my knowledge been assigned or even proposed
RFC status. The IDN IETF working group
<url:http://www.ietf.org/html.charters/idn-charter.html> is assigned the
task of defining international domainnames, but very little progress
seems have been made in the last years.. (unfortunately a common symptom
for most IETF working groups these days it seems... too much politics
involved I think)

3. Further there has been very little activity in addressing how the
upper layer Internet Protocols such as HTTP and SMTP should be
addressed, but there is two clear paths. Neither involve what you call
"8-bit clean".
 a) Application encoding of UTF-8 characters using the allowable
character sets until each protocol has been updated. A range of
different encodings have been proposed, but is seems only
 b) Direct use of UTF-8 within the protocols.

Approach 'a' requires no change in the protocols or infrastructure such
as proxies (or even DNS servers), only in the user interfaces and how
DNS names is registered. To the protocols international domain labels is
just strange looking sequences of normal a-z 0-9 - characters (well.
there is also a proposal for using %hh URL escaped syntax for URIs, not
sure how much attention this will receive however)

Approach 'b' requires the whole infrastructure to be updated to use
UTF-8 encoding, and each of the Internet Protocols redefined in what are
allowable, reserved or forbidden characters for use in Host names in the
scope of UTF-8.

4. As a result of very little or no progress in the standardization
efforts of the IDN IETF WG most DNS registrars have grown tired and is
starting to allow registration of "binary" DNS labels, which happens to
work in many browsers by accident using various national character
encodings (ISO-8859-X, UTF-8, ...), and not because such names is truly
allowed for use on the Internet. While this is not strictly forbidden
by DNS standards, the use of such domain names within Internet
application protocols such as HTTP or SMTP is.

I have a memory of seeing a kind of official document stating that UTF-8
should be used in all new Internet Protocols and that the long term goal
is to allow UTF-8 to be used anywhere by updating the existing Internet
protocols to support UTF-8 where possible, but now I don't seem to find
this document.. Also I do not remember if it was a IETF, IAB or ISOC
document..

A good in-depth reading on the subject of International Domain Names and
their associated problems within the Internet protocols is RFC2825 A
Tangled Web: Issues of I18N, Domain Names, and the Other Internet
protocols.

You are welcome to correct me if you find any errors in the above.

Regards
Henrik

-- 
Henrik Nordstrom <hno@squid-cache.org>
MARA Systems AB, Sweden
Received on Wed Jan 29 2003 - 07:43:33 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:12:56 MST