String/Buffer split & encoding

From: Henrik Nordstrom <henrik_at_henriknordstrom.net>
Date: Wed, 21 Jan 2009 11:41:03 +0100

My opinions on the subject:

Buffer operates on octets and nothing but octets, where each octet is a
8-bit unsigned integer.

String is encoding aware, decomposing those octets into characters.

But I don't see why we would ever need to support UCS-2 or other
multi-byte encodings. As far as the scope of HTTP and related protocols
strings are either US-ASCII, UTF-8 or Latin-1, which all fits nice in
the octet world. We also do not need encoding aware upper/lower case
distinction, only US-ASCII case awareness.

I also agree with Alex that there is no need for < > or == in buffers as
such. A string is fully capable of holding a binary blob. These
operators should in our context always map to memcmp().

The only difference wrt < > == class of operators for binary regions or
strings is the ability of case-insensitive operations. But for case
insensitive operations other operators should be used. Which leaves them
the same in both contexts.

>From the discussion it's apparent to me that the current naming
convention isn't the best. Buffer should be String.

The design I'd like to see is

- Low-level refcounted memory area (address, size, refcount).

- Memory area "splitter". (memory area, current used offset). Helper
class for producing Buffer regions. This is the primary interface for
producing Buffer regions.

- Buffer, region of a memory area. (memory area, offset, size).

- String, subclass of buffer adding < > == and strstr operators, plus
case-insensitive variants of == and strstr operators (and maybe < > as
well). No additional data members.

- A StringV container class allowing large strings to built from a list
of String:s, supporting vector access (for I/O), incremental strstr
searches (with a separate state class) and extracting regions as String
or StringV. Extracting as String may need a copy if the requested area
is not linear in memory.

Of these only the low-level memory area and perhaps StringV needs to be
refcounted. Buffer & String is small enough to be copied, or passed as a
const reference in most function/method calls.

I am not 100% sure on the placement of String. It's possible this should
be fully merged into Buffer, but I think it provides a good separation.
It's entirely possible we will end up using String all over the place
with Buffer just being used internally to String and some very low-level
I/O stuff.

Regards
Henrik
Received on Wed Jan 21 2009 - 10:41:43 MST

This archive was generated by hypermail 2.2.0 : Wed Jan 21 2009 - 12:00:26 MST