Re: pseudo-specs for a String class

From: Henrik Nordstrom <henrik_at_henriknordstrom.net>
Date: Wed, 27 Aug 2008 00:03:49 +0200

On tis, 2008-08-26 at 09:35 +0200, Kinkie wrote:
> In my opinion there is not that much of a difference between Strings
> and Buffers, and the latter could use the services of the former to
> delegate the issues of memory management, while concentrating on
> different aspects - joining, chaining, vector I/O come to mind.

Agreed. And as I gave this considerable thought some years ago
(including a simple implementation) I'll give my view of "strings" and
buffers:

The classes involved should be

MemoryBlob, a chunk of memory, reference counted with a high water mark
on current use. Exists in various different sizes depending on the use
(creator defined).

MemoryRegion, references a region of a MemoryBlob by keeping a reference
to the MemoryBlob, and location of the region within that blob.

String, subclass of MemoryRegion adding string semantics where needed.

MemoryRegions (and Strings) can be created from a MemoryBlob in append
like behavior only, where each new MemoryRegion immediately follows the
previous. In addition there is low level access to the current tail of
the raw buffer and the amount of free space, only to be used by the
owner who populates the MemoryBlob (i.e. I/O read function etc) before
it's known how much data that actually gets placed there.

Strings can be created from a MemoryRegion by specifying a subrange of
that region, or certain String operations who split a string in
components.. (i.e. a parser of some kind splitting the data in tokens)

Maybe MemoryRegion and String should be one and the same, but
implementation is probably easier to follow if String is a subclass with
the string methods, and in the using code it also makes some sense to
differentiate the two making it more visible what kind of data is being
processed.. But the internal data of the two is exactly the same
(reference to a MemoryBlob, pointer to the data within the blob, length)

The big and complex architectural question regarding strings is if \0
termination of String should be kept.. memory management and casting
gets a bit easier if \0 is not used, but string operations and debugging
do get a little bit easier with the \0... (but also less secure if
there is a risk of \0 in the string data..). I think we are at the point
where we can fully drop the \0 without too much headache, but but it's
also true that in all cases where we tokenise a string there is
separators we can nuke and replace by \0's... However, with the \0
casting between MemoryRegion and String is tricky (needs to copy if
there is no \0) and tokenising gets destructive as it destroys the
original string by replacing separators by \0..

Append operation on String/MemoryRegion objects is easy in this model,
but if the region is not at the end of the MemoryBlob or if the result
gets too large the it will need to trigger a copy to a new MemoryBlob of
sufficient size.

A special case to the above is if the appended data already follows the
first linearly. It's then a simple merge operation of the two regions.

Other modifications of String/MemoryRegion content generally requires a
COW operation.

As you already noted MemoryRegion is sufficiently small to be passed
around by value just like if it was a plain pointer.

Another question to ask is if there is need for a vectorized String
built of many non-linear segments. But my gut feeling is the same as
yours, that this should be a separate class. Main use is in writev kind
operations of composed data.

Users needing string like operations (other than append) on such
vectorized data is probably best served by linearising the data first.

Regards
Henrik

Received on Tue Aug 26 2008 - 22:03:57 MDT

This archive was generated by hypermail 2.2.0 : Wed Aug 27 2008 - 12:00:06 MDT