如何使用解析器组合器处理 'line-continuation'
How to handle 'line-continuation' using parser combinators
我正在尝试使用 Sprache 解析器组合器库编写一个小型解析器。解析器应该能够将以单个 \
结尾的行解析为无关紧要的白色 space.
问题
如何创建一个解析器来解析 =
符号后可能包含行继续符 \
的值?
例如
a = b\e,\
c,\
d
应解析为 (KeyValuePair (Key, 'a'), (Value, 'b\e, c, d'))
。
我一般不熟悉使用此库和解析器组合器。因此,非常感谢任何指向正确方向的指示。
我试过的
测试
public class ConfigurationFileGrammerTest
{
[Theory]
[InlineData("x\\n y", @"x y")]
public void ValueIsAnyStringMayContinuedAccrossLinesWithLineContinuation(
string input,
string expectedKey)
{
var key = ConfigurationFileGrammer.Value.Parse(input);
Assert.Equal(expectedKey, key);
}
}
生产
尝试一
public static readonly Parser<string> Value =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Except(Parse.Char('\')).Many()
.Or(Parse.String("\\n")
.Then(chs => Parse.Return(chs))).Or(Parse.AnyChar.Except(Parse.LineEnd).Many())
select new string(rest.ToArray()).TrimEnd();
测试输出
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\
↑ (pos 1)
尝试二
public static readonly Parser<string> SingleLineValue =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Many().Where(chs => chs.Count() < 2 || !(string.Join(string.Empty, chs.Reverse().Take(2)).Equals("\\n")))
select new string(rest.ToArray()).TrimEnd();
public static readonly Parser<string> ContinuedValueLines =
from firsts in ContinuedValueLine.AtLeastOnce()
from last in SingleLineValue
select string.Join(" ", firsts) + " " + last;
public static readonly Parser<string> Value = SingleLineValue.Once().XOr(ContinuedValueLines.Once()).Select(s => string.Join(" ", s));
测试输出
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\n y
↑ (pos 1)
您不得在输出中包含续行符。这是最后一个单元测试的唯一问题。当您解析延续 \\n
时,您必须将其从输出结果中删除并 return 空字符串。抱歉,我不知道如何使用 C# sprache 来做到这一点。也许是这样的:
Parse.String("\\n").Then(chs => Parse.Return(''))
我使用 combinatorix python 库解决了这个问题。它是一个解析器组合器库。 API 使用函数而不是使用链式方法,但思路是一样的。
这是带注释的完整代码:
# `apply` return a parser that doesn't consume the input stream. It
# applies a function (or lambda) to the output result of a parser.
# The following parser, will remove whitespace from the beginning
# and the end of what is parsed.
strip = apply(lambda x: x.strip())
# parse a single equal character
equal = char('=')
# parse the key part of a configuration line. Since the API is
# functional it reads "inside-out". Note, the use of the special
# `unless(predicate, parser)` parser. It is sometime missing from
# parser combinator libraries. What it does is use `parser` on the
# input stream if the `predicate` parser fails. It allows to execute
# under some conditions. It's similar in spirit to negation in prolog.
# It does parse *anything until an equal sign*, "joins" the characters
# into a string and strips any space starting or ending the string.
key = strip(join(one_or_more(unless(equal, anything))))
# parse a single carriage return character
eol = char('\n')
# returns a parser that return the empty string, this is a constant
# parser (aka. it always output the same thing).
return_empty_space = apply(lambda x: '')
# This will parse a full continuation (ie. including the space
# starting the new line. It does parse *the continuation string then
# zero or more spaces* and return the empty string
continuation = return_empty_space(sequence(string('\\n'), zero_or_more(char(' '))))
# `value` is the parser for the value part. Unless the current char
# is a `eol` (aka. \n) it tries to parse a continuation, otherwise it
# parse anything. It does that at least once, ie. the value can not be
# empty. Then, it "joins" all the chars into a single string and
# "strip" from any space that start or end the value.
value = strip(join(one_or_more(unless(eol, either(continuation, anything)))))
# this basically, remove the element at index 1 and only keep the
# elements at 0 and 2 in the result. See below.
kv_apply = apply(lambda x: (x[0], x[2]))
# This is the final parser for a given kv pair. A kv pair is:
#
# - a key part (see key parser)
# - an equal part (see equal parser)
# - a value part (see value parser)
#
# Those are used to parse the input stream in sequence (one after the
# other). It will return three values: key, a '=' char and a value.
# `kv_apply` will only keep the key and value part.
kv = kv_apply(sequence(key, equal, value))
# This is sugar syntax, which turns the string into a stream of chars
# and execute `kv` parser on it.
parser = lambda string: combinatorix(string, kv)
input = 'a = b\e,\\n c,\\n d'
assert parser(input) == ('a', 'b\e,c,d')
我正在尝试使用 Sprache 解析器组合器库编写一个小型解析器。解析器应该能够将以单个 \
结尾的行解析为无关紧要的白色 space.
问题
如何创建一个解析器来解析 =
符号后可能包含行继续符 \
的值?
例如
a = b\e,\
c,\
d
应解析为 (KeyValuePair (Key, 'a'), (Value, 'b\e, c, d'))
。
我一般不熟悉使用此库和解析器组合器。因此,非常感谢任何指向正确方向的指示。
我试过的
测试
public class ConfigurationFileGrammerTest
{
[Theory]
[InlineData("x\\n y", @"x y")]
public void ValueIsAnyStringMayContinuedAccrossLinesWithLineContinuation(
string input,
string expectedKey)
{
var key = ConfigurationFileGrammer.Value.Parse(input);
Assert.Equal(expectedKey, key);
}
}
生产
尝试一 public static readonly Parser<string> Value =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Except(Parse.Char('\')).Many()
.Or(Parse.String("\\n")
.Then(chs => Parse.Return(chs))).Or(Parse.AnyChar.Except(Parse.LineEnd).Many())
select new string(rest.ToArray()).TrimEnd();
测试输出
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\
↑ (pos 1)
尝试二
public static readonly Parser<string> SingleLineValue =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Many().Where(chs => chs.Count() < 2 || !(string.Join(string.Empty, chs.Reverse().Take(2)).Equals("\\n")))
select new string(rest.ToArray()).TrimEnd();
public static readonly Parser<string> ContinuedValueLines =
from firsts in ContinuedValueLine.AtLeastOnce()
from last in SingleLineValue
select string.Join(" ", firsts) + " " + last;
public static readonly Parser<string> Value = SingleLineValue.Once().XOr(ContinuedValueLines.Once()).Select(s => string.Join(" ", s));
测试输出
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\n y
↑ (pos 1)
您不得在输出中包含续行符。这是最后一个单元测试的唯一问题。当您解析延续 \\n
时,您必须将其从输出结果中删除并 return 空字符串。抱歉,我不知道如何使用 C# sprache 来做到这一点。也许是这样的:
Parse.String("\\n").Then(chs => Parse.Return(''))
我使用 combinatorix python 库解决了这个问题。它是一个解析器组合器库。 API 使用函数而不是使用链式方法,但思路是一样的。
这是带注释的完整代码:
# `apply` return a parser that doesn't consume the input stream. It
# applies a function (or lambda) to the output result of a parser.
# The following parser, will remove whitespace from the beginning
# and the end of what is parsed.
strip = apply(lambda x: x.strip())
# parse a single equal character
equal = char('=')
# parse the key part of a configuration line. Since the API is
# functional it reads "inside-out". Note, the use of the special
# `unless(predicate, parser)` parser. It is sometime missing from
# parser combinator libraries. What it does is use `parser` on the
# input stream if the `predicate` parser fails. It allows to execute
# under some conditions. It's similar in spirit to negation in prolog.
# It does parse *anything until an equal sign*, "joins" the characters
# into a string and strips any space starting or ending the string.
key = strip(join(one_or_more(unless(equal, anything))))
# parse a single carriage return character
eol = char('\n')
# returns a parser that return the empty string, this is a constant
# parser (aka. it always output the same thing).
return_empty_space = apply(lambda x: '')
# This will parse a full continuation (ie. including the space
# starting the new line. It does parse *the continuation string then
# zero or more spaces* and return the empty string
continuation = return_empty_space(sequence(string('\\n'), zero_or_more(char(' '))))
# `value` is the parser for the value part. Unless the current char
# is a `eol` (aka. \n) it tries to parse a continuation, otherwise it
# parse anything. It does that at least once, ie. the value can not be
# empty. Then, it "joins" all the chars into a single string and
# "strip" from any space that start or end the value.
value = strip(join(one_or_more(unless(eol, either(continuation, anything)))))
# this basically, remove the element at index 1 and only keep the
# elements at 0 and 2 in the result. See below.
kv_apply = apply(lambda x: (x[0], x[2]))
# This is the final parser for a given kv pair. A kv pair is:
#
# - a key part (see key parser)
# - an equal part (see equal parser)
# - a value part (see value parser)
#
# Those are used to parse the input stream in sequence (one after the
# other). It will return three values: key, a '=' char and a value.
# `kv_apply` will only keep the key and value part.
kv = kv_apply(sequence(key, equal, value))
# This is sugar syntax, which turns the string into a stream of chars
# and execute `kv` parser on it.
parser = lambda string: combinatorix(string, kv)
input = 'a = b\e,\\n c,\\n d'
assert parser(input) == ('a', 'b\e,c,d')