Re: Revised approach to fixing configuration syntax

From: Tsantilas Christos <chtsanti_at_users.sourceforge.net>
Date: Fri, 27 Sep 2013 17:13:55 +0300

On 09/26/2013 09:14 PM, Alex Rousskov wrote:
> On 08/30/2013 11:39 AM, Alex Rousskov wrote:
>
>> One of the key ideas behind "strict syntax" is that the token "type" can
>> be and is determined by the parser (simple token, quoted string, regular
>> expression, number, etc.) and not the caller. This is not the only way
>> to support better syntax rules, but it is overall simpler and more
>> flexible than the caller-determines-token-type alternatives (because the
>> code can manipulate tokens with respect to comments, macros, line
>> continuation, whitespace, and such without relying on callers to parse
>> each character).
>
>
> Hello,
>
> While I still believe that the above is correct, after many email
> exchanges with Christos and several patches, I suspect we should not
> implement the above. My concern is that if we do, the fixed syntax will
> never become the default because there will be just too much work for
> admins to rewrite their configurations, parts of which they may not even
> control. Initially, I thought we can remain mostly backwards compatible
> with the fixed syntax, but I think I lost that important goal along the
> way to the strict syntax.
>

My sense is that it is not possible to have backward compatibility. We
had discuss in "[PATCH] Unknown cfg function" mail thread many cases
where the quoted tokens can not be backward compatible. The problem is
not only in regex expresions.

If the default quoted tokens mode enabled by default a lot of
administrator will not have desired behaviour in their systems.
For example in legacy mode the token "AToken" includes the quotes but in
quoted tokens mode the quotes are not included.
This is not the only problem. In legacy mode, quotes may have different
meaning in configuration parameters, somewhere supported, in other cases
are not supported and in acls used to load external files.
The problem with these cases is that they are not easy to detect them.
This is always will prohibit as to turn on by default the quote tokens mode.

Maybe we can do the transition in steps. For example:

- 3.4 release: Add quoted tokens, legacy is enabled by default

- 3.5 release: legacy does not allow quotes in tokens, force users to
specify the mode they want to use.

If the users uses:
cache_mgr "Christos"
acl theSrcIPs src "/usr/local/squid3/etc/ip-addresses.txt"

force him to use:

configuration_includes_quoted_values on
cache_mgr "\"Christos\""
acl theSrcIPs src
squid::parameters("/usr/local/squid3/etc/ip-addresses.txt")
configuration_includes_quoted_values default

OR

configuration_includes_quoted_values off
cache_mgr "Christos"
acl theSrcIPs src "/usr/local/squid3/etc/ip-addresses.txt"
configuration_includes_quoted_values default

Maybe there are other similar cases where we need to force users to
specify the mode they want to use.

- 3.6 release: quoted tokens enabled by default

I believe the transition to new mode will not be easy...

I read the grammar carefully and my sense is that it is OK, it is what
we need. Maybe we need to make some improvements but in any case it is
in right direction.
Also I like the method proposed in (B).

> To void any misunderstanding, these are my thoughts. Christos has
> supplied valuable insight and very useful code that brought me here, but
> he has not reviewed and may not share my conclusions.
>
> Hopefully, this discussion will help us find the right balance between
> the needed syntax improvements and the need to remain backwards
> compatible in the vast majority of cases.
>
>
> I think the key problem here is our (or my) desire to use a simple
> grammar definition for the strict syntax. I think we have to relax that
> by using Squid context-aware parsers more aggressively. Also, we have to
> make a couple of new syntax constructs less beautiful to minimize
> collision chances. Initial deployment of v3.4 code exposed a few
> unexpected (but unfortunately common) collisions.
>
>
> Specifically, I propose the following:
>
> A) Define Squid syntax as described in the grammar quoted at the end of
> this email. Some adjustments would be necessary, I am sure, but I think
> it is important to have a grammar definition as a starting point. They
> key in that grammar is the ambiguous rule after "The caller must ask"
> comment. If it works, it will allow us to support most of the old
> configurations, especially when dealing with old options.
>
> B) Adjust the low-level parsing code to provide the following methods,
> each of which extract the requested thing and return true (on success)
> or false (on failures). If they fail to extract what was requested, they
> leave the parsing stream unchanged:
>
> NextChars(const char *); // the exact sequence of characters given
> NextToken(String &t); // next name or token as defined in the grammar
> NextValue(String &v, kind); // next value as defined in the grammar
> NextSpace(String *v); // Optional White Space (OWS) as in grammar
> NextLineEnd(); // end of line; may be \0 if preprocessor strips \n
>
> Mark CurrentPosition(); // current parsing position
> void RevertTo(Mark mark); // reverts low-level parsing state to mark
>
> Also add the following class to properly manage "undo". Note that we
> cannot just put parsed elements on stack any more because the next
> caller may use a different syntax to parse the same sequence of characters.
>
> // remembers CurrentPosition() upon construction and undoes damages
> // using RevertTo upon destruction unless .success() has been called
> class ParseProgress {
> public:
> ParseProgress(): start(CurrentPosition()), succeeded(false) {}
> ~ParseProgress() { if (!succeeded) RevertTo(start); }
> bool success() { return succeeded = true; }
> bool failure() const { return false; } // for clarity
>
> private:
> Mark start; ///< where the parsing started
> bool succeeded; ///< whether our owner successfully parsed
> };
>
>
>
> C) Add parameter parsing helpers sketched below, using the above API:
>
> bool ParamWithValue(const char *name, String &value, bool required)
> ParseProgress progress;
>
> if (!NextChars(name))
> return progress.success();
>
> if (!NextChars("=")) {
> if (required) ... failure: missing value for name
> return false;
> }
>
> // we may need to make strictOrLegacy a fourth ParamWithValue()
> // parameter so that callers can control what kind of values
> // are allowed
> if (NextValue(value, strictOrLegacy))
> return progress.success();
>
> if (required)
> failure: missing value for name
>
> return false;
> }
>
> // directive parameter without a value
> bool ParamWoutValue(const char *name)
> {
> if (!NextChars(name))
> return false;
>
> if (NextChars("="))
> failure: unexpected value for name
>
> return true;
> }
>
> bool MoreParams()
> {
> return NextSpace();
> // the line end will be checked for and absorbed
> // by the directive parser, not directive parameters parser
> }
>
> // called when all supported switches and parameter names
> // have been exhausted; we do not expect a name=value thing next
> // kills Squid on errors; otherwise returns false
> bool UnsupportedParamName()
> {
> ParseProgress progress;
> String name;
> if (NextToken(name) && NextChars("="))
> failure: unknown directive parameter name or switch: name
> return progress.failure(); // undo name extraction
> }
>
>
>
> Here is how a simple directive parameters parser would look like:
>
> do {
> String v;
> if (ParamWithValue("opt1", v, true)) ...
> else if (ParamWithValue("opt2", v, false)) ...
> else if (ParamWoutValue("opt3")) ...
> else if (!UnsupportedParamName()) {
> NextValue(v, ...) // anonymous parameter
> ...
> }
> } while (MoreParams());
>
>
> The above can be rearranged so that the directive parameters parsing
> code first extracts the parameter name and then compares it to supported
> parameters (like most of the current code does). However, I suspect it
> would create more problems than it would solve in some cases: The
> extracted thing may not be a parameter name but an [anonymous] value,
> and when you extract a value, you may want to tell the parser whether
> using relaxed syntax is allowed, but you cannot tell the parser that
> until you are sure that the next thing is not a parameter name (because
> the relaxed value parser is going to eat the '=' sign)! Thus, you should
> not extract a value until you know it is not a parameter name.
>
>
> I am omitting some details, of course, but do you think this approach
> with more explicit whitespace and relaxed value handling will cover
> nearly all current use cases(*), allowing the new parsing mode to be
> enabled by default soon? Or am I missing something again?
>
> (*) except quoted strings with unsupported %macros in old configs, of
> course, but we are willing to sacrifice those, I guess. The syntax also
> allows comments where spaces are allowed, but we can disable that if
> there are too many clashes.
>
>
> Thank you,
>
> Alex.
>
>
> Squid configuration syntax:
>
> config = *( OWS directive OWS )
>
> directive = name *( RWS parameter ) OWS <new line>
>
> name = token
> parameter = anonymous_parameter / named_parameter
>
> anonymous_parameter = value
> named_parameter = name <=> value ; No spaces around = allowed!
>
> ; This is where the grammar is ambiguous!
> ; The caller must ask for either strict_value, relaxed_value, or
> ; "strict_value if possible and relaxed_value otherwise"
> value = strict_value / relaxed_value
>
> strict_value = token / percent / RE / fcall /
> single_quoted_string / double_quoted_string
>
> ; any non-whitespace sequence that does not start with one of
> ; the two reserved prefixes: "squid::" and "regex::"
> relaxed_value = xchar *xchar ; with prefix restrictions
>
> ; The "squid::" prefix avoids fcall clashes with relaxed_values
> ; and allows us to add more functions later.
> ; We may require that fcall is the only or last directive parameter.
> fcall = "squid::" name <(> OWS farguments OWS <)>
>
> ; we can support just one required argument for now
> farguments = fargument *( OWS <,> OWS <fargument> )
> fargument = strict_value
>
> ; TBD, but reserve "regex::" now to avoid clashes with relaxed_values
> RE = seven characters "regex::" followed by a self-delimiting
> sequence of characters
>
> token = tchar *tchar
> percent = alphanumeric *alphanumeric <%> ; add <.>?
> single_quoted_string := <'> *[sqchar / escaped-pair] <'>
> double_quoted_string := <"> *[dqchar / escaped-pair] <">
>
>
> tchar = alphanumeric / <_> / <-> / <.>
> xchar = any char except ascii_whitespace_char and new line
> sqchar = any char except single quote and backslash
> dqchar = any char except double quote and backslash
> escaped-pair = backslash followed by any char except new line
>
> OWS = *[whitespace] ; optional whitespace
> RWS = whitespace OWS ; required whitespace
>
> whitespace = *ascii_whitespace_char / comment
> comment = <#> followed by all characters until the end of the line
>
>
> Please note that the above grammar is applied after the preprocessing
> stage which handles lines continuations using backslashes, conditionals
> and ${macros}. No changes there. Unfortunately, our preprocessor cannot
> tokenize because tokenization should remain context-dependent for
> backward compatibility reasons.
>
>
>
Received on Fri Sep 27 2013 - 14:14:22 MDT

This archive was generated by hypermail 2.2.0 : Sat Sep 28 2013 - 12:00:11 MDT