Re: Revised approach to fixing configuration syntax

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Fri, 27 Sep 2013 18:20:08 +1200

On 27/09/2013 6:14 a.m., Alex Rousskov wrote:
> On 08/30/2013 11:39 AM, Alex Rousskov wrote:
>
>> One of the key ideas behind "strict syntax" is that the token "type" can
>> be and is determined by the parser (simple token, quoted string, regular
>> expression, number, etc.) and not the caller. This is not the only way
>> to support better syntax rules, but it is overall simpler and more
>> flexible than the caller-determines-token-type alternatives (because the
>> code can manipulate tokens with respect to comments, macros, line
>> continuation, whitespace, and such without relying on callers to parse
>> each character).
>
> Hello,
>
> While I still believe that the above is correct, after many email
> exchanges with Christos and several patches, I suspect we should not
> implement the above. My concern is that if we do, the fixed syntax will
> never become the default because there will be just too much work for
> admins to rewrite their configurations, parts of which they may not even
> control. Initially, I thought we can remain mostly backwards compatible
> with the fixed syntax, but I think I lost that important goal along the
> way to the strict syntax.
>
> To void any misunderstanding, these are my thoughts. Christos has
> supplied valuable insight and very useful code that brought me here, but
> he has not reviewed and may not share my conclusions.
>
> Hopefully, this discussion will help us find the right balance between
> the needed syntax improvements and the need to remain backwards
> compatible in the vast majority of cases.
>
>
> I think the key problem here is our (or my) desire to use a simple
> grammar definition for the strict syntax. I think we have to relax that
> by using Squid context-aware parsers more aggressively. Also, we have to
> make a couple of new syntax constructs less beautiful to minimize
> collision chances. Initial deployment of v3.4 code exposed a few
> unexpected (but unfortunately common) collisions.
>
>
> Specifically, I propose the following:
>
> A) Define Squid syntax as described in the grammar quoted at the end of
> this email. Some adjustments would be necessary, I am sure, but I think
> it is important to have a grammar definition as a starting point. They
> key in that grammar is the ambiguous rule after "The caller must ask"
> comment. If it works, it will allow us to support most of the old
> configurations, especially when dealing with old options.
>
> B) Adjust the low-level parsing code to provide the following methods,
> each of which extract the requested thing and return true (on success)
> or false (on failures). If they fail to extract what was requested, they
> leave the parsing stream unchanged:
>
> NextChars(const char *); // the exact sequence of characters given
> NextToken(String &t); // next name or token as defined in the grammar
> NextValue(String &v, kind); // next value as defined in the grammar
> NextSpace(String *v); // Optional White Space (OWS) as in grammar
> NextLineEnd(); // end of line; may be \0 if preprocessor strips \n
>
> Mark CurrentPosition(); // current parsing position
> void RevertTo(Mark mark); // reverts low-level parsing state to mark
>
> Also add the following class to properly manage "undo". Note that we
> cannot just put parsed elements on stack any more because the next
> caller may use a different syntax to parse the same sequence of characters.

This implies that we require StringNG terminator-agnostic ability for
the String& in all the above functions - OR to copy each potential line
into a temporary String which can be dropped on unwind. That
terminator-agnostic is provided by StringArea, but IIRC kinkie warned
against relying on StringArea for parser usage due to its lack of cow()
and pointer safety.

> // remembers CurrentPosition() upon construction and undoes damages
> // using RevertTo upon destruction unless .success() has been called
> class ParseProgress {
> public:
> ParseProgress(): start(CurrentPosition()), succeeded(false) {}
> ~ParseProgress() { if (!succeeded) RevertTo(start); }
> bool success() { return succeeded = true; }
> bool failure() const { return false; } // for clarity
>
> private:
> Mark start; ///< where the parsing started
> bool succeeded; ///< whether our owner successfully parsed
> };
>
>
>
> C) Add parameter parsing helpers sketched below, using the above API:
>
> bool ParamWithValue(const char *name, String &value, bool required)
> ParseProgress progress;
>
> if (!NextChars(name))
> return progress.success();
>
> if (!NextChars("=")) {
> if (required) ... failure: missing value for name
> return false;
> }
>
> // we may need to make strictOrLegacy a fourth ParamWithValue()
> // parameter so that callers can control what kind of values
> // are allowed
> if (NextValue(value, strictOrLegacy))
> return progress.success();
>
> if (required)
> failure: missing value for name
>
> return false;
> }
>
> // directive parameter without a value
> bool ParamWoutValue(const char *name)
> {
> if (!NextChars(name))
> return false;
>
> if (NextChars("="))
> failure: unexpected value for name
>
> return true;
> }
>
> bool MoreParams()
> {
> return NextSpace();
> // the line end will be checked for and absorbed
> // by the directive parser, not directive parameters parser
> }
>
> // called when all supported switches and parameter names
> // have been exhausted; we do not expect a name=value thing next
> // kills Squid on errors; otherwise returns false
> bool UnsupportedParamName()
> {
> ParseProgress progress;
> String name;
> if (NextToken(name) && NextChars("="))
> failure: unknown directive parameter name or switch: name
> return progress.failure(); // undo name extraction
> }
>
>
>
> Here is how a simple directive parameters parser would look like:
>
> do {
> String v;
> if (ParamWithValue("opt1", v, true)) ...
> else if (ParamWithValue("opt2", v, false)) ...
> else if (ParamWoutValue("opt3")) ...
> else if (!UnsupportedParamName()) {
> NextValue(v, ...) // anonymous parameter
> ...
> }
> } while (MoreParams());
>
>
> The above can be rearranged so that the directive parameters parsing
> code first extracts the parameter name and then compares it to supported
> parameters (like most of the current code does). However, I suspect it
> would create more problems than it would solve in some cases: The
> extracted thing may not be a parameter name but an [anonymous] value,
> and when you extract a value, you may want to tell the parser whether
> using relaxed syntax is allowed, but you cannot tell the parser that
> until you are sure that the next thing is not a parameter name (because
> the relaxed value parser is going to eat the '=' sign)! Thus, you should
> not extract a value until you know it is not a parameter name.

Agreed +1. This is no less efficient than the current code doing its
series of if-conditions on a strtok() output string. But gains us the
undo functionality a lot more cheaply.

> I am omitting some details, of course, but do you think this approach
> with more explicit whitespace and relaxed value handling will cover
> nearly all current use cases(*), allowing the new parsing mode to be
> enabled by default soon? Or am I missing something again?

Comment handling (see blow for the fix), and the detail that StringNG
will allow us to implement the undo fuctionality far better than String
- which implies that the large amount of conversion work will have an
almost as larhe amount of conversion to SBuf later in 3.5.

Since this design is again prioritizing backward compatibility I think
we can schedule this for 3.5+ and work on some bridging polish to avoid
3.4 parser having any user noticible difference from the final strict
parser. For starters ignore the code internals being vastly different
maintenance nightmare - that is not a user-visible aspect and we will
cope with that pain ourselves.

What I am thinking is that if we agree on this grammar being suitable
(you and I mostly agree, lets see what the others think) then we can
massage the 3.4 parser to cope with it. No matter how nasty the hacks
may have to be in that version we drop them in 3.5 going forward. I do
not see re-implementing the entire parser starting today as a quick
enough project to fix things before 3.4 needs to be stable.

>
> (*) except quoted strings with unsupported %macros in old configs, of
> course, but we are willing to sacrifice those, I guess. The syntax also
> allows comments where spaces are allowed, but we can disable that if
> there are too many clashes.

If we take a token of "#" surrounded by whitespace as comment start, or
a # in the position EOL is expected this should not be an issue for most.

Since comments go to EOL it will collide with the preprocessor
un-wrapping wrapped lines if a comment is added anywhere space is
possible. I am inclined to make comment a case in itself with the
following alterations to the grammar:

config = *( OWS directive / comment <new line> )

directive = name *( RWS parameter ) comment <new line>

whitespace = *ascii_whitespace_char

comment = OWS ( <#> RWS <followed by all characters until the end of the
line (excluding that EOL)>)

Line wrapping grammar is this:

fold = <\> <new line>

whitespace = *( ascii_whitespace_char / fold )

with no need or possibility to fold inside a comment.

> Thank you,
>
> Alex.
>
>
> Squid configuration syntax:
>
> config = *( OWS directive OWS )
>
> directive = name *( RWS parameter ) OWS <new line>
>
> name = token
> parameter = anonymous_parameter / named_parameter
>
> anonymous_parameter = value
> named_parameter = name <=> value ; No spaces around = allowed!
>
> ; This is where the grammar is ambiguous!
> ; The caller must ask for either strict_value, relaxed_value, or
> ; "strict_value if possible and relaxed_value otherwise"
> value = strict_value / relaxed_value
>
> strict_value = token / percent / RE / fcall /
> single_quoted_string / double_quoted_string
>
> ; any non-whitespace sequence that does not start with one of
> ; the two reserved prefixes: "squid::" and "regex::"
> relaxed_value = xchar *xchar ; with prefix restrictions
>
> ; The "squid::" prefix avoids fcall clashes with relaxed_values
> ; and allows us to add more functions later.
> ; We may require that fcall is the only or last directive parameter.
> fcall = "squid::" name <(> OWS farguments OWS <)>
>
> ; we can support just one required argument for now
> farguments = fargument *( OWS <,> OWS <fargument> )
> fargument = strict_value
>
> ; TBD, but reserve "regex::" now to avoid clashes with relaxed_values
> RE = seven characters "regex::" followed by a self-delimiting
> sequence of characters
>
> token = tchar *tchar
> percent = alphanumeric *alphanumeric <%> ; add <.>?
> single_quoted_string := <'> *[sqchar / escaped-pair] <'>
> double_quoted_string := <"> *[dqchar / escaped-pair] <">
>
>
> tchar = alphanumeric / <_> / <-> / <.>
> xchar = any char except ascii_whitespace_char and new line
> sqchar = any char except single quote and backslash
> dqchar = any char except double quote and backslash
> escaped-pair = backslash followed by any char except new line
>
> OWS = *[whitespace] ; optional whitespace
> RWS = whitespace OWS ; required whitespace
>
> whitespace = *ascii_whitespace_char / comment
> comment = <#> followed by all characters until the end of the line

comments should be a # _after_ some whitespace. Not instead of,
otherwise we cannot tell foo#bar from "foo #bar"

>
> Please note that the above grammar is applied after the preprocessing
> stage which handles lines continuations using backslashes, conditionals
> and ${macros}. No changes there. Unfortunately, our preprocessor cannot
> tokenize because tokenization should remain context-dependent for
> backward compatibility reasons.

Thank you. It is good to have this written down finally.

Amos
Received on Fri Sep 27 2013 - 06:20:18 MDT

This archive was generated by hypermail 2.2.0 : Fri Sep 27 2013 - 12:00:11 MDT