Re: StringNg review request: SBuf.h

From: Kinkie <gkinkie_at_gmail.com>
Date: Mon, 27 Sep 2010 00:47:22 +0200

On Wed, Sep 22, 2010 at 3:25 AM, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On 09/21/2010 11:29 AM, Kinkie wrote:
>>>
>>> Review is based on lp:~kinkie/squid/stringng r9471.
>>
>>>> #define SQUIDSBUFPRINT(s) s.length(),s.exportRawRef()
>>>
>>> Missing parens around s.
>>
>> It's not causing iesues, but won't hurt either.
>
> It is a bug.
>
>
>> SquidString has the same issue in trunk.
>
> Fixed.
>
>
>>>>     u_int64_t qset; ///number of assigns/appends without content copy
>>>>     u_int64_t sset; ///number of assigns/appends with content copy
>>>
>>> I am not sure what the actual intent here is, but the description does
>>> not
>>> match use. qset is used in clear() and other methods unrelated to
>>> assigns/appends.
>>
>> How about reworking the stats so that they actually count the
>> performed operations? (e.g. assign fast/slow, append fast/slow,
>> compare fast/slow etc.)?
>
> My recommendation is to remove most of the stats until we know what we
> actually need. Leaving either (constructor, destructor) stats or alive
> instances stats may be a good idea. I do not see how the rest can be useful
> for performance analysis right now.

To get an idea about how effective SBuf useage is to callers.

> I do not insist on removing any stats though. You may have some
> well-justified needs for all of them. My only requirement is that all stats
> are correct (which is impossible without a good description).

>>>>  * Notice that users MUST NOT hold pointers to SBuf, as doing so
>>>>  * moots the whole point of using refcounted buffers in the first place.
>>>>  */
>>>
>>> Remove as technically inaccurate. Besides, users MUST NOT do many things.
>>
>> ok. How about adding some text discouraging pointers and references?
>> (const ref is ok unless non-const operations are needed)
>
> I am against it. SBuf.h is not the right place for a "Good Coding Practices"
> tutorial and your attempts to summarize the rules in a couple of sentences
> are likely continue to fail because there are many cases and many
> exceptions. If you feel like writing about general rules for proper use of
> refcounted objects, add a RefCount.dox file.

Ok. If such a need arises, a wiki article would be probably a better idea.

>>>> class SBuf {
>>>>     public:
>>>>     typedef signed int size_type;
>>>
>>> I think we should support 3GB offsets and such. Could be handy for
>>> referencing memory cache areas without copying. Is there any reason to
>>> limit
>>> this to int? How about using size_t or ssize_t?
>>
>> 3Gb monolithic buffers are not a very good idea imvho (see the maxSize
>> tuneable).
>
> Why not?! When I am writing an mmapped shared-memory cache, I do not have
> the luxury of allocating small buffers. I may want to maximize the size of
> the segment to minimize the number of segments used. Whether that would be
> the best design is debatable, but it seems wrong to decide that debate by
> limiting the maximum SBuf size like that.

The limit is aribtrary, and there is no code that could hit it now.
I'm all for changig it when there's a need. I just can't see the need
now.

>> Given this, the best datatype is probably int32_t - helps
>> for printf-style portability.
>
> You already have a printf-friendly methods. I doubt we need to limit the
> buffer size just because printf() is difficult to deal with, especially when
> new code should avoid printf().
>
>>>>     ///  Flag used by search methods to define case sensitivity
>>>>     static const bool caseInsensitive = false;
>>>>     static const bool caseSensitive = true;
>>>
>>> Convert to enum?
>>
>> What is the benefit? (my ignorance)
>
> Groups common names together, allows missing switch() value checks, may
> allow for safer/aggressive compiler optimization, does not force you to make
> up constant values, assures unique values by default, disables bad behavior
> such as taking caseSensitive address, etc. Not a big deal in this context
> though.

Implemented via typedef enum SBufCaseSensitive in SBuf.h

>
>
>>>>      *
>>>>      * A SBuf can be empty, but can never be undefined. In other words
>>>> there is no equivalent
>>>>      * of NULL. defined/undefined semantics need to be implemented by
>>>> the
>>>> caller.
>>>>      */
>>>
>>> Remove: Avoid documenting what SBuf cannot be. If you insist on
>>> discussing
>>> the non-existent APIs, move that discussion to the SBuf.dox file.
>>
>> How about moving a simplified version of it that to the class
>> documentation?
>> I have committed such an attempt.
>
> You are still documenting negatives, which is counter-productive. I know
> what you mean by "can never be undefined" but a new developer will not know
> how to interpret that because there is no defined() function in C/C++.
>

Is
"A SBuf can be empty (zero-length), but always points to a valid
memory address."
clearer?

>>>>     /** Constructor: import c-style string
>>>>      *
>>>>      * Create a new SBuf containing a COPY of the contents of the
>>>>      * c-string
>>>>      * \param S the c string to be copied
>>>>      * \param Ssize the size of S. If it is SBuf::npos or unspecified,
>>>>      *              S must be null-terminated
>>>
>>> Remove Ssize parameter. You have too many sizes. If you want a
>>> two-parameter
>>> conversion from c-string, define another constructor with just two
>>> parameters. See std::string.
>>
>> std::string API does not cover specifying pos and length when
>> importing c-strings.
>> While I agree that mimicking that API is good, isn't this variant more
>> effective than copy-all-then-substring (or even worse
>> manipulate-in-c-then-import) ?
>
> I am not against specifying pos and n when importing. I am against
> specifying size, pos, and n. The size argument is not needed (and you made
> mistakes when trying to tie all three together).

Ok, removed.

>
>
>>>>      * It is the caller's duty to free the imported string, if needed.
>>>>      * \param S the c string to be copied
>>>>      * \param Ssize the size of S. If it is SBuf::npos or unspecified,
>>>>      *              S must be null-terminated
>>>>      * \param pos how many bytes to skip at the beginning of the
>>>> c-string
>>>>      * \param n how many bytes to import into the SBuf. If it is
>>>> SBuf::npos
>>>>      *              or unspecified, imports to end-of-cstring
>>>>      */
>>>>     SBuf&  assign(char const *S, size_type Ssize = npos, size_type pos =
>>>> 0,
>>>> size_type n=npos);
>>>
>>> Again, remove Ssize parameter. It is not needed.

Done.

>> Are you sure? How about the case of non-null-terminated strings (e.g.
>> stuff coming out of read(2)? I'd have to offer two variants of the
>> API, one without Ssize and one with, and then there'd by type clashes
>> for specifying pos and n, which in turn means that a single call turns
>> into import-all-then-substring..
>
> You need neither size nor two variants of the API. Everything you can do
> with (size, pos, n), you can do with just (pos, n).
>
>
>>>>     /** clear the SBuf as if it was just created.
>>>>      *
>>>>      */
>>>>     void clear();
>>>
>>> s/clear/reset/ in the description. Document whether the already allocated
>>> memory stays allocated. That fact is important for performance-sensitive
>>> caller code.
>>
>> Fixed. Changed code behaviour to free memory (it's a #define in .cc),
>> documented.
>
> Freeing memory is the wrong default because it introduces significant
> performance overheads that the callers cannot avoid. It also hurts the
> common case of reusing the same buffer.
>
> When non-freeing is the default (as it used to be), if the caller does not
> need the buffer any longer but has to keep it for a long time (an unusual
> condition!), she can just assign SBuf() to it.

Reverted change. Memory freeing is now lazy.

>>>>     SBuf&  append(const char * S, size_type Ssize = npos, size_type pos
>>>> =
>>>> 0, size_type n = npos);
>>>
>>> Remove Ssize.
>>
>> See above: same question as for constructor and assign
>
> Same answer.

Same change.

>>>>     /** Assignment operation with printf(3)-style definition
>>>>      * \note arguments may be evaluated more than once, be careful
>>>>      *       of side-effects
>>>>      */
>>>>     SBuf&  Printf(const char *fmt, ...);
>>>
>>> Add "Avoid this and other ... methods. For transition only."?
>>> We should be using ostream formatting instead.
>>
>> Agreed, if only I could think of a sane way to do so (my ignorance).
>
> If you point me to a specific (but hopefully rather common) use case where
> you cannot use ostream, I will try to come up with a sane way of using it.
> Most likely, it will require adding a FixedSizeStream class of sorts.

My ignorance is in a way to implement an ostream interface with a
custom backing store.
Amos has shed some light for me (StoreEntryStream), I'll need help or
some more detailed documentation to do it on my own.

>>>>     /** direct-access set a byte at a specified operation.
>>>>      * \param pos the position to be overwritten
>>>>      * \param toset the value to be written
>>>>      * \throw OutOfBoundsException when pos is of bounds
>>>>      * \note performs a copy-on-write if needed.
>>>>      */
>>>>     void setAt(size_type pos, char toset);
>>>
>>> Perhaps document whether setting the next-to-last character is allowed?
>>> Sometimes string APIs allow for that...
>>
>> Any character can be set. I do not understand why there may be a
>> distinction..
>
> Because next-to-last character exceeds string length: s[s.size()+1]
> Some string APIs allow that and treat it as append(). I doubt we should, but
> I did not check what SquidString and std::string do.

Ok. Then no, it's not allowed. Use append(&char,0,1) for that. Could
make an append(char) method if needed.

> You may want to document what "in bounds" mean exactly to avoid such
> questions: 0 <= pos < size()

Documented.

>>> It would be better to avoid bare "this" in the above comments. It is
>>> awkward
>>> to read.
>>
>> I can't think of any better way to call it. I've removed the
>> reference, but I'm not sure it's clearer.
>> Copy-paste:
>>     /** compare to other SBuf, strcmp-style
>>      *
>>      * Compare two SBufs. behaves like strcmp(3) or strncmp(3) depending
>>      * on the value of the case_sensitive parameter
>
> Avoid repetition. You already document what the parameter does, below. And I
> think you meant strcasecmp(3). Consider:

Yes

>
> /** compare to another SBuf, strcmp-style
>
>>      * \param case_sensitive one of SBuf::caseSensitive or
>> SBuf::caseInsensitive
>>      * \retval>0 argument is greater
>>      * \retval<0 argument is smaller
>>      * \retval 0 argument has the same contents
>>      */
>
> Now the wording is much better but the meaning is wrong :-). We should
> return a positive number if the argument is smaller. "This" is the implicit
> _first_ parameter of strcmp(). Your "argument" is the second parameter...

Tried another wording. But I really need help from some native speaker
in order to really define it well.

>>>>     void terminate();
>>>
>>> Remove. Too dangerous if the effect is not persistent. You do not
>>> implement
>>> persistent termination, and we probably do not need a persistent
>>> termination
>>> anyway.
>>
>> Moved to private member function. I like the idea of it being a
>> visible action (as opposed to hand-inlining it)
>
> Agreed.
>
>>>>     /** Import a c-style string into the memory-managed world
>>>>      *
>>>>      * The char* MUST have been dynamically allocated via new,
>>>
>>> c-strings are usually allocated using new[] rather than new.
>>
>> changed the documentation; since it's freed via xfree, it must be
>> allocated via xmalloc.
>> The implementation is transitional, what it SHOULD do is construct a
>> zero-copy MemBlob out of the user-supplied c-string, and it would be
>> freed as a MemBlob (currently xfree, when implementation is finalized
>> memFreeString).
>> But I agree that it is a dangerous API and the benefits probably do
>> not balance the risks, so I will not object to removing it.
>
> If the old classes do not have a similar method or it is not used often, I
> suggest removing it. If you could not keep new/free consistent while
> confined to SBuf, I am sure we will get it wrong in the user code.

Ok.

>>>>     /** Copy SBuf contents into user-supplied C buffer.
>>>>      *
>>>>      * Export a copy of the SBuf's contents into the user-supplied
>>>>      * buffer, up to the user-supplied-length.
>>>>      * The exported buffer is guarranteed to have a terminating \0, but
>>>>      * may truncate contents if the SBuf is too big.
>>>>      * \return 0 if all is OK
>>>>      * \return num the number of actually-copied chars, INCLUDING the \0
>>>>      * \note uses xstrncpy as low-level function
>>>>      */
>>>>     size_type copy(char *buf, size_type buflen) const;
>>>
>>> Do we really need to zero-terminate the above? This low-level method may
>>> be
>>> handy for I/O which does not require zero-termination. We should minimize
>>> the number of APIs that terminate, IMO.
>>
>> Zero-termination is performed by xstrncpy(), AFTER the copy is done.
>> It doesn't cow().
>> Should I mention this in the documentation instead?
>
> My point is quite different. Do we really need to zero-terminate the copied
> data? I suspect that we do not and should not.
>
> The fact that your implementation currently does terminate because it uses
> xstrncpy() is irrelevant.
>
> Moreover, your implementation is buggy because xstrncpy does not guarantee a
> full copy for binary data. Our xstrncpy also writes beyond buflen and the
> standard strncpy does not guarantee termination. Get us out of this snake
> pit!

Gah! Right. Moving to xmemcpy()

> The correct implementation should probably use memcpy().
>
> I am guessing we do not care about overlapping regions in this specific
> method because copying from SBufs to SBuf is only possible if the two have
> different MemBlobs. The implementation should document this assumption or
> even assert it.

Copy among SBufs doesn't really copy data, so there is no issue there.

>>>>     char* exportRawRef() const;
>>>
>>> I do not like the Ref() suffixes. You are not really exporting a
>>> reference
>>> to internal storage. You are just providing a raw pointer to it. Call
>>> this
>>> "raw()" or, following std::string convention, "data()".
>>
>> how about "internalStorage", or "getRawDataPtr"? I'd like to keep very
>> visible also in calling code the fact that this is a dangerous call.
>> Also, this call should be mostly usedful during the transition.
>
> internalStorage() would be better than exportRawRef(). IMO, raw() is visible
> enough though. How about rawStorage() as a compromise?

As discussed on IRC, will move to
const char * rawContent() const
(-> raw access to memory for READING)
char * rawSpace()
(-> raw access to memory for WRITING, does cow(). Return a pointer to
the first unused byte in the backend store)

There is a problem with this approach though. If we write to the SBuf
outside its import methods, we also need to signal it how much we
wrote (e.g. a method "haveAppended"). But it's lots of layer-breaking
:(

> Why is this method const if we are giving writeable access to our internals?
> See below for a summary table related to this question.

Writing to the backing store must be avoided; unfortunately there is
no way I am aware of to prevent it. This is a dangerous method, and
must be used with caution. At the same time, there are a lot of
transitional scenarios (and quite a few non-transitional ones) where
reading the storage without copying it is worth the risk IMVHO.

> BTW, this is another case where cow() is missing. Without cow(), it is not
> possible to use exportRawRef() safely, in general:
>
>    b1 = "foo";
>    b2 = b1;
>    ...
>    read(b1.exportRawRef(), 3); // BAD: corrupts b2 but should not

exportRawRef was not designed for this. absorb() was designed (with
faults) for this, as in:
SBuf b1;
char *buf=xmalloc(bsize);
rd=read(buf,bsize);
b1.absorb(buf,bsize,rd);

> Another missing thing here is an API to set the length after a read() or
> similar call fills the raw storage:
>
>    b.reserve(1024); // allocate
>    sz = read(b.exportRawRef(), b.capacity()); // fill
>    b.length(sz); // sync
>    // now b is usable again
>
> As the above example illustrates, we also need SBuf::capacity().
>
> All of this string-unrelated, direct-access, low-level stuff is the price
> for merging String and Buffer into SBuf, BTW.

It's more risky than without, but I am against procecting us from
ourselves too much. Rope does have some uses sometimes.

>>>>     char* exportTRef();
>>>
>>> See above. Similar comments apply here.
>>>
>>> Let's call this c_str(), following std::string convention.
>>
>> done.
>
> Can we make c_str() return a "const char *" pointer? Have you found many
> uses where the caller would need to modify the returned zero-terminated
> string? If you are not sure, make it const.

Done.
It actually uncovered a couple of bugs.

> To summarize and polish both for const:
>
>    // not 0-terminated, writeable internal storage; AVOID
>    char *rawStorage();
>
>    // terminated c-string; may not be internal storage; SLOW
>    const char *c_str();
>
> Eventually, we should make c_str() a const method but let's not go there
> now. You may add a TODO about that.

Both are now const. Easier to make them non-const if needed than the opposite.

>>>>     /** slicing method
>>>>      *
>>>>      * extracts a SBuf containing the portion from<i>from</i>  to
>>>> <i>to</i>,
>>>>      * intersected with the actual SBuf length.
>>>
>>> Awkward description, especially because we do not extract but erase.
>>> Consider: Removes SBuf prefix and suffix, leaving a sequence of bytes
>>> from
>>> <i>from</i>  to<i>to</i>.
>>
>> Fixed.
>>
>>> However, I think we should follow std::string lead here and change the
>>> parameters to be "pos" and "n". I do not know why they picked those, but
>>> let's be consistent with them and with our other methods. They also do
>>> not
>>> clash with "to" and "from" words in the description :-).  I would also
>>> rename this to leave(), which will help you to check that all callers use
>>> the right params after the parameters change :-).
>
> You do not want to use pos and n for chop(), do you? Using the same (pos,n)
> interface may make the user code easier to write and review. Imagine
> constantly thinking: "is the second parameter 'n' or 'to'?". I think we
> should be consistent here.

I agree. Done. It also actually makes error handling clearer and simpler.

>>>>  u_int16_t nreallocs; /// number of reallocs this SBuf has been subject
>>>> to
>>>
>>> Since nreallocs is copied on assignment, it's true meaning is rather
>>> difficult to define. However, it is _not_ the number of reAlloc() calls
>>> that
>>> _this_ SBuf has seen.
>>
>> nreallocs is meant to be a hint to the memory management routines to
>> define the likeness of needing a grow operation, and thus to define
>> how much headroom to leave. It's currently unused in favour of a
>> static (20% headroom + rounding) approach.
>> Any feedback on optimizing the heuristic and when to propagate the
>> hint is welcome.
>
> My feedback is to remove the unused member until somebody properly defines
> and uses it.

I've tried to define it instead. The algorithm needs tweaking, but
this should already
give an idea about what it may be about. The useage scenario it is
designed for is:
SBuf s;
while (s.append(something())) {
   SBuf read=s.consume(chunk)
   doSomethingWith(read);
}
s will get realloc'ed more and more; by giving more headroom we ensure
that it will adapt to the useage pattern.

SBuf::size_type SBuf::estimateCapacity (SBuf::size_type desired) const
{
    if (nreallocs < 10)
            return 2*desired;
    if (nreallocs < 100)
            return 4*desired;
    if (nreallocs < 1000)
            return 8*desired;
    return 16*desired;
}

>>>>     static const size_type malloc_overhead = 8; //tuneable.
>>>
>>> Missing description. What is it?
>>
>> Another yet-unused optimization for memory management.
>> The general code to define how much RAM to allocate for a SBuf
>> requesting "bytes" storage would be something similar to:
>> round_to_boundary(bytes + malloc_overhead + bytes *
>> multiplier_function(nreallocs)) - malloc_overhead
>>
>> where:
>> - round_to_boundary (MemBlob::memAllocationPolicy) aims making the
>> memory block page-aligned as much as possible
>> - malloc_overhead is a constant (generally os- and libc- dependent)
>> which takes into account libc's malloc overhead
>> - mutliplier_function (SBuf::estimateCapacity) would try to give more
>> headroom to long-lived SBufs with lots of append operations
>
> There is a good change the actual implementation will be very different, and
> that the optimizations you are considering will be done elsewhere. Please
> save us all time and remove unused code. Your ideas for future optimizations
> do not become less valuable if they are not included in SBuf.* as unused
> code. In fact, their current net value increases!

Removed. IMVHO we should however ensure that those optimizations are
actually made somewhere. This is one area where I understand that
Varnish does things right.

>>>>     _SQUID_INLINE_ char * buf() const;
>>>>     _SQUID_INLINE_ char * bufEnd() const;
>>>
>>> Why const if they allow SBuf modification?
>>
>> They don't allow modifications. They are private functions to get
>> pointers and end of "our" store, for interfacing with the legacy C
>> world.
>
> They should return "const char *" if they do not allow modification.

I believe I wasn't clear: those are simply private mehtods meant to
help with recurring activities.

>>>>     _SQUID_INLINE_ size_type estimateCapacity(size_type desired);
>>>
>>> Should be const if not static.
>>
>> Done. Should be inline, really.
>
> I doubt it should be inlined, but both const and static can be inlined.

Const. Not static as it relies on nreallocs.

>>> I noticed that you print "this" and similar raw pointers in debugging
>>> messages. In my experience, those quickly become unusable in non-trivial
>>> logs because when old objects are deleted, the new ones reuse their
>>> pointers. Consider adding unique SBuf identifiers instead. We can always
>>> remove them later if the overhead of extra 32 or 64 bits becomes too
>>> great
>>> after the transition is over.
>>
>> Sounds a good idea.
>
> If you are up for it, I will try to contribute an Id class that will
> implement this so that you can just add a printable member of the right type
> and be done with it. Many things need unique IDs. Let me know.

While I'm at it, I may just do it.
How about a signature like this?:
class InternalObjectId {
  InternalObjectId(const char * classname);
  uint32_t getOid;
  private:
  uint32_t oid;
}
ostream & operator << (ostream &)

>> How about making this depend on a SBUF_DEBUG or
>> some other preprocessor macro?
>
> I want to be able to ask users for logs without asking them to recompile. I
> think a few extra bytes are worth it in most contexts, but others may
> disagree.

Feedback anyone?

>>>> int SBuf::plength() const
>>>
>>> Do we need to place the return type on its own line in .cci?
>>
>> Should get fixed by running the formatter tool.
>
> AFAIK, the formatter tool does not fix this, unfortunately. I see Polygraph
> and other copied code in Squid that is not formatted correctly because of
> this.

Done in .cci and .cc

>>> For SBuf.cc:
>
>>>> SBuf::SBuf(const char *S, size_type Ssize, size_type pos, size_type n)
>>>>     : off_(0), nreallocs(0) //len_ and store_ defined inside the call
>>>> {
>>>>     //very similar to SBuf::append(const char*...). Can be refactored?
>>>
>>> Same as above.
>>
>> I can't really think how. Most of the code adjusts locals, and would
>> need to be almost-duplicate-but-slightly-differetn anywys.
>
>
> We will need to come back to this after Ssize is gone and grow() is fixed
> then. I cannot offer a good implementation sketch when the API is broken.
>
>
>>>> void SBuf::grow(size_type minsize)
>>>> {
>>>>     debugs(SBUF_DEBUGSECTION, 7,
>>>>         MYNAME<<  "this="<<(void *)this<<
>>>>         ",target="<<  minsize);
>>>>     if (minsize>  maxSize)
>>>>         throw SBufTooBigException(__FILE__,__LINE__);
>>>>     if (minsize<  length()) { //nop. We're already big enough.
>>>
>>> The length() is irrelevant here. You need to check available space size,
>>> not
>>> allocated content size. More on the grow() checks below.
>>
>> Hm.. length() is the safe choice. Otherwise we also have to take care
>> of offsets in the MemBlob.
>
> While indeed very safe, length() is the wrong choice. We must grow if we do
> not have enough allocated space. We must not grow if we have enough
> allocated space. How much content we currently have in that space is not a
> factor:
>
>    SBuf b; // empty buffer
>    b.grow(100); // OK, will realloc because length() is zero
>    b.grow(100); // BAD: will realloc again
>    b.grow(100); // BAD: will realloc again
>    b.grow(100); // BAD: will realloc again
>
> Please fix.

Changed to query the store.

> (This and several related problems probably come from your internal-only use
> of public grow(). Unfortunately, it confused things quite a lot for me, but
> now there is a clear path forward).
>
>
>>>
>>>>         debugs(SBUF_DEBUGSECTION, 7, HERE<<  "not growing");
>>>>     }
>>>>     reAlloc(estimateCapacity(minsize));
>>>> }
>>>
>>> Optimization: We should probably memmove() before we reAlloc() if it
>>> helps
>>> to satisfy minsize.
>>
>> If we realloc often, it probably means "this" SBuf needs more
>> headroom. see the allocation algorihtm above.
>> So reAlloc-ing is IMVHO not a bad thing per se.
>
> The simple optimization I suggested does not preclude more complex
> optimizations in the future and does not make them less effective. In other
> words, "allocate more" may be a valuable addition in the future, but if we
> can postpone a memory allocation now, we probably should.
>
> I certainly do not insist on you implementing the memmove optimization if
> you do not want to. I do not even insist on adding a comment that it should
> be done. If you can remove the unused code related to the other, far more
> complex and uncertain optimizations, that would be nice.

Uncertain, I agree :)
Far more complex, I don't 100% agree: you have to remember that a SBuf
doesn't own a MemBlob, so a memmove choice would have to include
checks on:
- is the SBuf the sole owner of the Blob
- is it worth it? probably needs some relatively complex calculations
on how much data there is in the SBuf, and how much unused space there
is at the head of the MemBlob.

Adding a comment, I agree, and have done so.

>>>> void SBuf::clear()
>>>> {
>>>> #if 0
>>>>     //enabling this code path, the store will be freed and reinitialized
>>>>     store_=getStorePrototype(); //uncomment to actually free storage
>>>> upon
>>>> clear()
>>>> #else
>>>>     //enabling this code path, we try to release the store without
>>>> deallocating it.
>>>>     // will be lazily reallocated if needed.
>>>>     if (store_->RefCountCount() == 1)
>>>>         store_->bufUsed=0;
>>>
>>> Please remove this bufUsed optimization. It appears to be broken (you
>>> increment bufUsed inconsistently), it is complex, and we can always add
>>> it
>>> later.
>>
>> Done. I've switched the #if 0 around
>
> Which was the wrong change, IMO (see above). I just asked for the
> bufUsed-related lines to be removed, and not the correct algorithm to be
> replaced with the wrong one.
>
> In general, please avoid unrelated changes to the code under review until
> the code is accepted. Otherwise, you create a moving target for the
> reviewers.
>
>
> In fact, I like the bufUsed optimization a lot now, after I realized what
> your grow() and canAppend() hacks actually do!  We should restore it as it
> may be very important in common cases. However, we should not be accessing
> refCount or bufUsed directly like that (here and elsewhere in SBuf). MemBlob
> should have an API to keep used portion size up-to-date. We need to come
> back to this after the other bugs are fixed. Please remind me if I forget or
> add a TODO.

Unfortunately it stems from the fact that MemBlob is meant to be a
proveate-ish class.
We can add an API that performs these operations, but it's going to be
a layer-breaker of sorts anyways.
Agreed on reworking this later. Added Todo.

>
>
>>>> SBuf&  SBuf::append(const SBuf&S)>>>> {
>>>>     debugs(SBUF_DEBUGSECTION,7,
>>>>         MYNAME<<  "this="<<  (void *)this<<
>>>>         "appending="<<(void *)&S<<")");
>>>>     if (!store_->canAppend(bufEnd(),S.length())) {
>>>>         grow(length()+S.length());
>>>>     }
>>>
>>> Please remove all such canAppend() guards. Just call grow(). It is much
>>> safer than risking to forget a guard. Your grow() should not reAlloc if
>>> growth is not needed. You may rename grow() to assureSpace() or similar
>>> if
>>> you prefer but do not split the growth need check and the realloc call in
>>> the "user" code.
>>
>> How about refactoring it? A prepareForAppend that does the guards and
>> grows if needed (grow() still needs to be a public API, the caller may
>> know more than the low-level does)
>
> You do not need specialized prepareForAppend() because grow(), being a
> public API, must do the same checks and grow() is used not just before
> append().
>
> Here is the summary of steps, just in case something is still not clear:
>
> 1. Remove all canAppend() calls, leaving unprotected grow() calls.
> 2. Implement grow() correctly. It may use canAppend() and
>   memmove() optimizations but it does not have to. See std::reserve()
>   for what grow() has to do.
> 3. Rename grow() to reserve() and keep it public.

I see your point now. Done; I've changed the semantics of reserve so
that it makes sure that the store can grow ON TOP of what's currently
used It was almost always used like that anyways.

>>>>     debugs(SBUF_DEBUGSECTION,8,
>>>>         HERE<<  "appending");
>>>>     store_->append(S.buf(),S.length());
>>>>     len_+=S.length();
>>>
>>> This is missing cow(). And there are many similar bugs elsewhere. The
>>> right
>>> fix, IMO, is to hide store_ and provide two inlined protected methods
>>> similar to these:
>>
>> No, it's not missing cow(). store_->canAppend guarrantees that by
>> directly appending we're not trampling on anyone else and that the
>> store has enough space to hold what we want to append. grow() will cow
>> for us otherwise.
>
> You are right, sorry. I should have picked a different suspect. There is no
> cow() needed in append() because of the canAppend optimization. I will check
> other suspects after the remaining bugs are fixed.
>
>
>>> MemBlob&store() { cow(); return *store_; } /// memory access for writing
>>> const MemBlob&store() const { return *store_; } /// read-only memory
>>> access
>>>
>>> This way, you do not need to remember or think about when to call cow(),
>>> which could be tricky at times. The beginning of cow() can be inlined to
>>> make initial "do we need to copy?" check very cheap.
>>
>> cow() doesn't know what we're trying to do, whether we're appending or
>> overwriting and if so how much or what.
>
> True. I did not understand the tricks you were doing with canAppend. I
> cannot think of a good way to make the above general protection compatible
> with canAppend right now. We will just need to be careful and call cow()
> when we need write access BUT not write access limited to the yet-unused
> portion of MemBlob.

This would probably go to a design document; the main trick is "if the
tail of the SBuf is at the tail of the MemBlob and the MemBlob has
space, then we can append without needing a cow. And by moving the
tail of the MemBlob, we're claiming ownership of the tail. If someone
else appends, they'll need to cow()".
This optimizes quite a lot the case of one-writer many-readers which I
expect to be quite common for e.g. protocol parsers: they get to
search-and-slice the read-buffer however they want, and they can
extract const-ish portions of the read-buffer for internal operations,
without copying the data.

>>>> int SBuf::compare(const SBuf&S, bool case_sensitive) const
>>>> {
>>>
>>> I think we should add an "n" parameter to limit the comparison scope to
>>> the
>>> first n characters. This will allow for much cheaper implementation of
>>> startsWith() and may be useful elsewhere as well. You are computing a
>>> similar parametr below anyway.
>>
>> addingthe param can be done, I wouldn't touch startsWith however..
>
>
> Why not? StartsWith is a prime candidate for using this parameter, is it
> not? It should also be a common call in Tokenizers so optimizing it is
> important, IMO.

Done.

>>>> SBuf SBuf::consume(SBuf::size_type howmuch)
>
>>> Do we really need to return consumed content?
>>
>> It is pretty handy; see how SBufUtil.cc:SBufSplit exploits it; and it
>> is a very cheap object to build.
>
> I would need a real/higher level code example to see the need (is not
> SBufUtil something you just added for the higher level code to use?), but
> this is not a big deal.
>
>
>> While waiting for the next round, I'll move on to MemBlob.
>
> Sounds good. I will probably look at Tokenizer soon, but I am worried about
> running out of time before at least something is completed and committed.
> Please do not add new stuff for now (unless requested) and focus on fixing
> the reviewed code. It will require at least one more iteration.

Ok, first iteration done.

Thanks!

-- 
    /kinkie
Received on Sun Sep 26 2010 - 22:47:30 MDT

This archive was generated by hypermail 2.2.0 : Tue Sep 28 2010 - 12:00:13 MDT