Re: pseudo-specs for a String class

From: Kinkie <gkinkie_at_gmail.com>
Date: Wed, 27 Aug 2008 00:24:49 +0200

On Wed, Aug 27, 2008 at 12:03 AM, Henrik Nordstrom
<henrik_at_henriknordstrom.net> wrote:
> On tis, 2008-08-26 at 09:35 +0200, Kinkie wrote:
>> In my opinion there is not that much of a difference between Strings
>> and Buffers, and the latter could use the services of the former to
>> delegate the issues of memory management, while concentrating on
>> different aspects - joining, chaining, vector I/O come to mind.
>
> Agreed. And as I gave this considerable thought some years ago
> (including a simple implementation) I'll give my view of "strings" and
> buffers:
>
>
> The classes involved should be
>
> MemoryBlob, a chunk of memory, reference counted with a high water mark
> on current use. Exists in various different sizes depending on the use
> (creator defined).
>
> MemoryRegion, references a region of a MemoryBlob by keeping a reference
> to the MemoryBlob, and location of the region within that blob.
>
> String, subclass of MemoryRegion adding string semantics where needed.
>
>
> MemoryRegions (and Strings) can be created from a MemoryBlob in append
> like behavior only, where each new MemoryRegion immediately follows the
> previous. In addition there is low level access to the current tail of
> the raw buffer and the amount of free space, only to be used by the
> owner who populates the MemoryBlob (i.e. I/O read function etc) before
> it's known how much data that actually gets placed there.

This is quite different from my current approach, by which Strings get
created and drive the instantiations of Bufs (MemoryRegions).
I feel that you'd be trying to reimplement parts of the memory
manager. Maximum efficiency, at the expense of quite a bit of
flexibility.

> Strings can be created from a MemoryRegion by specifying a subrange of
> that region, or certain String operations who split a string in
> components.. (i.e. a parser of some kind splitting the data in tokens)
>
>
> Maybe MemoryRegion and String should be one and the same, but
> implementation is probably easier to follow if String is a subclass with
> the string methods, and in the using code it also makes some sense to
> differentiate the two making it more visible what kind of data is being
> processed.. But the internal data of the two is exactly the same
> (reference to a MemoryBlob, pointer to the data within the blob, length)

Hm... interesting for annotation purposes, but is it really significant?

> The big and complex architectural question regarding strings is if \0
> termination of String should be kept.. memory management and casting
> gets a bit easier if \0 is not used, but string operations and debugging
> do get a little bit easier with the \0... (but also less secure if
> there is a risk of \0 in the string data..).

My thoughts: \0 is special, and would only be significant when strings
need to be exported from the memory-managed code onto nonmanaged code.
Generally speaking, the safest way to do so is by copy rather than by
reference, but I'd rather also keep the ability to export by reference
- hoping the caller knows what they're doing. In that case the \0 is a
must-have safeguard, in some cases might require copying. Unfortunate
but unavoidable.

> I think we are at the point
> where we can fully drop the \0 without too much headache, but but it's
> also true that in all cases where we tokenise a string there is
> separators we can nuke and replace by \0's... However, with the \0
> casting between MemoryRegion and String is tricky (needs to copy if
> there is no \0) and tokenising gets destructive as it destroys the
> original string by replacing separators by \0..

Well, tokenising should be replaced by substringing really.. it could
mean having to drop strtok().

> Append operation on String/MemoryRegion objects is easy in this model,
> but if the region is not at the end of the MemoryBlob or if the result
> gets too large the it will need to trigger a copy to a new MemoryBlob of
> sufficient size.

Yes.

> A special case to the above is if the appended data already follows the
> first linearly. It's then a simple merge operation of the two regions.

Yes.

> Other modifications of String/MemoryRegion content generally requires a
> COW operation.

It depends: I expect a rather common case to be when only one String
owns a Buf/MemoryBlob. In that case modifications are cheap.

> As you already noted MemoryRegion is sufficiently small to be passed
> around by value just like if it was a plain pointer.
>
> Another question to ask is if there is need for a vectorized String
> built of many non-linear segments. But my gut feeling is the same as
> yours, that this should be a separate class. Main use is in writev kind
> operations of composed data.
>
> Users needing string like operations (other than append) on such
> vectorized data is probably best served by linearising the data first.

Yes.

-- 
 /kinkie
Received on Tue Aug 26 2008 - 22:24:56 MDT

This archive was generated by hypermail 2.2.0 : Wed Aug 27 2008 - 12:00:06 MDT