Re: [RFC] Tokenizer API

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 10 Dec 2013 12:13:56 +1300

On 2013-12-10 08:00, Alex Rousskov wrote:
> Hello,
>
> The promised Tokenizer API proposal is attached. Compared to
> earlier
> proposals, this interface is simplified by focusing on finding tokens
> (requires knowledge of allowed and prohibited character sets), not
> parsing (requires knowledge of input syntax) and by acknowledging that
> real parsing rules are often too complex to be [efficiently] supported
> with a single set of delimiters. The parser (rather than a tokenizer)
> is
> a better place to deal with those complexities.
>
> The API supports checkpoints and backtracking by ... copying
> Tokenizers.
>
> I believe the interface allows for an efficient implementation,
> especially if the CharacterSet type is eventually redefined as a
> boolean
> array, providing us a constant time lookup complexity.

Agreed. +1 for going with this design.

Two requests for additional scope:
* can we place this is a separate src/parse/ library please?
  - we have other generic parse code the deserves to all be bundled up
together instead of spread out. Might as well start that collection
process now.

* Lets do the charset boolean array earlier rather than later. The
existing ones are rather nasty but they do "work" right now. Making this
project an optimization start to finish.

CharacterSet.h:

namespace Parser {

class CharacterSet
{
public:
   CharacterSet(const char * const c, size_t len) {
     memset(match_, 0, sizeof(match_));
     for (size_t i = 0; i < len; ++i) {
       match_[static_cast<uint8_t>(c)] = true;
     }
   }

   /// whether a given character exists in the set
   bool operator[](char t) const {return
match_[static_cast<uint8_t>(c)];}

   /// add all characters from the given CharacterSet to this one
   void merge(const CharacterSet &src) const {
     for (size_t i = 0; i < 256; ++i) {
       if(src.match_[i])
         match_[i] = true;
     }
   }

private:
   bool match_[256];
};

} // namespace Parser

NP: most of the time we will be wanting to define these CharacterSet as
global once-off objects. So I'm not sure if the merge() method is
useful, but shown here for completeness in case we want it for
generating composite character sets.

Amos
Received on Mon Dec 09 2013 - 23:14:04 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 10 2013 - 12:00:10 MST