Re: [PATCH] Make SBuf::find_first_of O(N) and implement find_first_not_of

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 16 Dec 2013 09:40:37 -0700

On 12/15/2013 04:54 AM, Kinkie wrote:

> attached patch is from branch lp:~squid/squid/characterset/

> + std::vector<bool> chars_; //std::vector<bool> is optimized

According to STL documentation for std::vector<bool> specialization:

> These changes provide a quirky interface to this specialization and
> favor memory optimization over processing (which may or may not suit
> your needs).

This is kind of the opposite of what we want. We want access speed, not
RAM or copying speed savings. I suspect we do not want std::vector<bool>
here! The same docs suggest using a different element type (char,
unsigned char, or custom) if a different optimization trade-off is
needed. I think we should follow that advice if we want to optimize this
(which was the primary point of doing a custom CharacterSet).

> It extracts some of the work done in parser-ng to implement Alex'
> Tokenizer API, bringing O(N) find-any{,-not}-of matching to SBuf.

> Note: in IRC discussions, Amos mentioned find_first_of and
> find_first_not_of not being naming convention compliant (that's right,
> they are modeled after std::string). I'm fine with changing them as a
> follow-up commit if this is the collegial decision; I'd avoid do to it
> as part of this merge.

I agree that the two should be renamed to use camelCase [as a follow up
commit].

> + * \return length() if all characters in the SBuf are from set

and the implementation matches that documentation:

> +SBuf::size_type
> +SBuf::find_first_not_of(const CharacterSet &set, size_type startPos) const
> +{
...
> + debugs(24, 7, "not found");
> + return length();
> +}

but STL docs say:

> Return Value
> The position of the first character that does not match.
> If no such characters are found, the function returns string::npos.

which makes more sense than returning length() IMO.

> + CharacterSet(const char *label, const char * const c) : name(label), chars_(std::vector<bool>(256,false)) {
> + size_t clen = strlen(c);
> + for (size_t i = 0; i < clen; ++i)
> + chars_[static_cast<uint8_t>(c[i])] = true;
> + }

Since we are using set.name [for debugging] without checking, please set
name to "anonymous" or similar if the label parameter is NULL.

> + size_t clen = strlen(c);

This variable can be const.

> + size_t clen = strlen(c);
> + for (size_t i = 0; i < clen; ++i)
> + chars_[static_cast<uint8_t>(c[i])] = true;

Please use add() to add characters to a set.

> + /// add a given char to the character set. c must be >= 0.
> + CharacterSet & add(const unsigned char c) {chars_[static_cast<uint8_t>(c)] = true; return *this; }

The "c must be >= 0" comment is redundant given that "c" is unsigned. If
you are worried about folks adding negative signed chars, try adding a
corresponding add(const signed char) method with a check.

> + /// name of this character set
> + const char * name;

The comment does not really document the data member (the data member
name itself says exactly the same thing already). Consider:

  /// optional set label for debugging ("anonymous" by default)

> +#include <vector>
> +
> +class CharacterSet

Missing class description. Consider:

  /// An optimized set of C chars,
  /// with a quick membership test and merge support.

> + /// characters defined in this set
> + std::vector<bool> chars_; //std::vector<bool> is optimized

s/defined/present/ or s/defined/included/

* CharacterSet creation and merging (+=) are heavy operations that do
not benefit from being inlined. I know it is a pain to deal with
Makefile dependencies, and I do not insist on this change, but please do
seriously consider placing them in .cc file.

FWIW, the following problems can be solved easier or better in a .cc
file. With a little bit of additional work, you can even remove #include
<vector> from CharacterSet users that way!

> + //precondition: src.chars_.size() == chars_.size()

There is no such precondition. I should be able to merge ALPHA and
NUMERIC sets, for example. Rewrite the += operator loop to simply
this->add() every character in the src set. Use std::for_each or another
<algorithm> for that if possible.

> + CharacterSet(const char *label, const char * const c) : name(label), chars_(std::vector<bool>(256,false)) {

> + std::vector<bool>::const_iterator s = src.chars_.begin();

> + const std::vector<bool>::const_iterator e = src.chars_.end();

> + std::vector<bool> chars_; //std::vector<bool> is optimized

Please avoid code duplication by typedef-ing std::vector<bool> as Chars
or similar and using the corresponding type name everywhere.

HTH,

Alex.
Received on Mon Dec 16 2013 - 16:41:11 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 17 2013 - 12:00:11 MST