Revised approach to fixing configuration syntax

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Thu, 26 Sep 2013 12:14:08 -0600

On 08/30/2013 11:39 AM, Alex Rousskov wrote:

> One of the key ideas behind "strict syntax" is that the token "type" can
> be and is determined by the parser (simple token, quoted string, regular
> expression, number, etc.) and not the caller. This is not the only way
> to support better syntax rules, but it is overall simpler and more
> flexible than the caller-determines-token-type alternatives (because the
> code can manipulate tokens with respect to comments, macros, line
> continuation, whitespace, and such without relying on callers to parse
> each character).

Hello,

    While I still believe that the above is correct, after many email
exchanges with Christos and several patches, I suspect we should not
implement the above. My concern is that if we do, the fixed syntax will
never become the default because there will be just too much work for
admins to rewrite their configurations, parts of which they may not even
control. Initially, I thought we can remain mostly backwards compatible
with the fixed syntax, but I think I lost that important goal along the
way to the strict syntax.

To void any misunderstanding, these are my thoughts. Christos has
supplied valuable insight and very useful code that brought me here, but
he has not reviewed and may not share my conclusions.

Hopefully, this discussion will help us find the right balance between
the needed syntax improvements and the need to remain backwards
compatible in the vast majority of cases.

I think the key problem here is our (or my) desire to use a simple
grammar definition for the strict syntax. I think we have to relax that
by using Squid context-aware parsers more aggressively. Also, we have to
make a couple of new syntax constructs less beautiful to minimize
collision chances. Initial deployment of v3.4 code exposed a few
unexpected (but unfortunately common) collisions.

Specifically, I propose the following:

A) Define Squid syntax as described in the grammar quoted at the end of
this email. Some adjustments would be necessary, I am sure, but I think
it is important to have a grammar definition as a starting point. They
key in that grammar is the ambiguous rule after "The caller must ask"
comment. If it works, it will allow us to support most of the old
configurations, especially when dealing with old options.

B) Adjust the low-level parsing code to provide the following methods,
each of which extract the requested thing and return true (on success)
or false (on failures). If they fail to extract what was requested, they
leave the parsing stream unchanged:

   NextChars(const char *); // the exact sequence of characters given
   NextToken(String &t); // next name or token as defined in the grammar
   NextValue(String &v, kind); // next value as defined in the grammar
   NextSpace(String *v); // Optional White Space (OWS) as in grammar
   NextLineEnd(); // end of line; may be \0 if preprocessor strips \n

   Mark CurrentPosition(); // current parsing position
   void RevertTo(Mark mark); // reverts low-level parsing state to mark

Also add the following class to properly manage "undo". Note that we
cannot just put parsed elements on stack any more because the next
caller may use a different syntax to parse the same sequence of characters.

   // remembers CurrentPosition() upon construction and undoes damages
   // using RevertTo upon destruction unless .success() has been called
   class ParseProgress {
   public:
       ParseProgress(): start(CurrentPosition()), succeeded(false) {}
       ~ParseProgress() { if (!succeeded) RevertTo(start); }
       bool success() { return succeeded = true; }
       bool failure() const { return false; } // for clarity

   private:
       Mark start; ///< where the parsing started
       bool succeeded; ///< whether our owner successfully parsed
   };

C) Add parameter parsing helpers sketched below, using the above API:

bool ParamWithValue(const char *name, String &value, bool required)
   ParseProgress progress;

   if (!NextChars(name))
       return progress.success();

   if (!NextChars("=")) {
       if (required) ... failure: missing value for name
       return false;
   }

   // we may need to make strictOrLegacy a fourth ParamWithValue()
   // parameter so that callers can control what kind of values
   // are allowed
   if (NextValue(value, strictOrLegacy))
       return progress.success();

   if (required)
       failure: missing value for name

   return false;
}

// directive parameter without a value
bool ParamWoutValue(const char *name)
{
   if (!NextChars(name))
       return false;

   if (NextChars("="))
       failure: unexpected value for name

   return true;
}

bool MoreParams()
{
    return NextSpace();
    // the line end will be checked for and absorbed
    // by the directive parser, not directive parameters parser
}

// called when all supported switches and parameter names
// have been exhausted; we do not expect a name=value thing next
// kills Squid on errors; otherwise returns false
bool UnsupportedParamName()
{
  ParseProgress progress;
  String name;
  if (NextToken(name) && NextChars("="))
      failure: unknown directive parameter name or switch: name
  return progress.failure(); // undo name extraction
}

Here is how a simple directive parameters parser would look like:

    do {
        String v;
        if (ParamWithValue("opt1", v, true)) ...
        else if (ParamWithValue("opt2", v, false)) ...
        else if (ParamWoutValue("opt3")) ...
        else if (!UnsupportedParamName()) {
            NextValue(v, ...) // anonymous parameter
            ...
        }
    } while (MoreParams());

The above can be rearranged so that the directive parameters parsing
code first extracts the parameter name and then compares it to supported
parameters (like most of the current code does). However, I suspect it
would create more problems than it would solve in some cases: The
extracted thing may not be a parameter name but an [anonymous] value,
and when you extract a value, you may want to tell the parser whether
using relaxed syntax is allowed, but you cannot tell the parser that
until you are sure that the next thing is not a parameter name (because
the relaxed value parser is going to eat the '=' sign)! Thus, you should
not extract a value until you know it is not a parameter name.

I am omitting some details, of course, but do you think this approach
with more explicit whitespace and relaxed value handling will cover
nearly all current use cases(*), allowing the new parsing mode to be
enabled by default soon? Or am I missing something again?

(*) except quoted strings with unsupported %macros in old configs, of
course, but we are willing to sacrifice those, I guess. The syntax also
allows comments where spaces are allowed, but we can disable that if
there are too many clashes.

Thank you,

Alex.

Squid configuration syntax:

  config = *( OWS directive OWS )

  directive = name *( RWS parameter ) OWS <new line>

  name = token
  parameter = anonymous_parameter / named_parameter

  anonymous_parameter = value
  named_parameter = name <=> value ; No spaces around = allowed!

  ; This is where the grammar is ambiguous!
  ; The caller must ask for either strict_value, relaxed_value, or
  ; "strict_value if possible and relaxed_value otherwise"
  value = strict_value / relaxed_value

  strict_value = token / percent / RE / fcall /
                 single_quoted_string / double_quoted_string

  ; any non-whitespace sequence that does not start with one of
  ; the two reserved prefixes: "squid::" and "regex::"
  relaxed_value = xchar *xchar ; with prefix restrictions

  ; The "squid::" prefix avoids fcall clashes with relaxed_values
  ; and allows us to add more functions later.
  ; We may require that fcall is the only or last directive parameter.
  fcall = "squid::" name <(> OWS farguments OWS <)>

  ; we can support just one required argument for now
  farguments = fargument *( OWS <,> OWS <fargument> )
  fargument = strict_value

  ; TBD, but reserve "regex::" now to avoid clashes with relaxed_values
  RE = seven characters "regex::" followed by a self-delimiting
       sequence of characters

  token = tchar *tchar
  percent = alphanumeric *alphanumeric <%> ; add <.>?
  single_quoted_string := <'> *[sqchar / escaped-pair] <'>
  double_quoted_string := <"> *[dqchar / escaped-pair] <">

  tchar = alphanumeric / <_> / <-> / <.>
  xchar = any char except ascii_whitespace_char and new line
  sqchar = any char except single quote and backslash
  dqchar = any char except double quote and backslash
  escaped-pair = backslash followed by any char except new line

  OWS = *[whitespace] ; optional whitespace
  RWS = whitespace OWS ; required whitespace

  whitespace = *ascii_whitespace_char / comment
  comment = <#> followed by all characters until the end of the line

Please note that the above grammar is applied after the preprocessing
stage which handles lines continuations using backslashes, conditionals
and ${macros}. No changes there. Unfortunately, our preprocessor cannot
tokenize because tokenization should remain context-dependent for
backward compatibility reasons.
Received on Thu Sep 26 2013 - 18:14:27 MDT

This archive was generated by hypermail 2.2.0 : Fri Sep 27 2013 - 12:00:11 MDT