用于平衡嵌套括号的超强解析器
Superpower parser for balanced nested parentheses
我正在努力为下面的部分输入集(嵌套的平衡括号和“|”分隔符)想出一个 superpower 解析器。
括号内可以包含任意文本,包括空格、其他标记和“()”。只有'|'、'('、')'在这里应该有特殊含义(换行符也会结束序列)。为了有效,每个平衡的、带括号的组必须有一个“|”和至少一个不是“(”或“)”的字符。
理想情况下,解析器会将每个输入拆分为一个列表,其中的元素可以是(终端)字符串,也可以是字符串数组,如下所示:
有效:
(a|) -> { "a", "" }
(a | b) -> { "a", "b" }
(a | b.c()) -> { "a", "b.c()" }
(aa | bb cc ) -> { "aa" "bb cc" }
(a | b | c #dd) -> { "a", "b", "c #dd"}
((a | b) | $c) -> { { "a", "b" }, "$c" }
((a | b) | (c | d)) -> { { "a", "b" }, { "c", "d" } }
(((a | b) | c) | d) -> { { { "a", "b" }, "c" }, "d" }
...
Invalid/ignored:
()
())
(()
(|)
(|())
(.)
(())
(()|())
(abc)
(a bc)
(a.bc())
...
我的代币(用于此处)如下:
public enum Tokens
{
[Token(Example = "(")]
LParen,
[Token(Example = ")")]
RParen,
[Token(Example = "|")]
Pipe,
[Token(Description = "everything-else")]
String
}
这很棘手,主要是因为您需要保留空白,但我能够想出一个满足您需要的解析器。首先,我不得不稍微改变你的 Tokens
枚举:
public enum Tokens
{
None,
String,
Number,
[Token(Example = "()")]
OpenCloseParen,
[Token(Example = "(")]
LParen,
[Token(Example = ")")]
RParen,
[Token(Example = "#")]
Hash,
[Token(Example = "$")]
Dollar,
[Token(Example = "|")]
Pipe,
[Token(Example = ".")]
Dot,
[Token(Example = " ")]
Whitespace,
}
接下来,我们可以构建如下Tokenizer
:
var tokenizer = new TokenizerBuilder<Tokens>()
.Match(Span.EqualTo("()"), Tokens.OpenCloseParen)
.Match(Character.EqualTo('('), Tokens.LParen)
.Match(Character.EqualTo(')'), Tokens.RParen)
.Match(Character.EqualTo('#'), Tokens.Hash)
.Match(Character.EqualTo('$'), Tokens.Dollar)
.Match(Character.EqualTo('.'), Tokens.Dot)
.Match(Character.EqualTo('|'), Tokens.Pipe)
.Match(Character.EqualTo(' '), Tokens.Whitespace)
.Match(Span.MatchedBy(Character.AnyChar), Tokens.String)
.Match(Numerics.Natural, Tokens.Number)
.Build();
接下来,创建模型 类 来保存输出(您可能会想到更好的名称,因为我不确定您正在解析的数据到底是什么):
public abstract class Node
{
}
public class TextNode : Node
{
public string Value { get; set; }
}
public class Expression : Node
{
public Node[] Nodes { get; set; }
}
然后我们创建解析器:
public static class MyParsers
{
/// <summary>
/// Parses any whitespace (if any) and returns a resulting string
/// </summary>
public readonly static TokenListParser<Tokens, string> OptionalWhitespace =
from chars in Token.EqualTo(Tokens.Whitespace).Many().OptionalOrDefault()
select chars == null ? "" : new string(' ', chars.Length);
/// <summary>
/// Parses a valid text expression
/// e.g. "abc", "a.c()", "$c", etc.
/// </summary>
public readonly static TokenListParser<Tokens, Node> TextExpression =
from tokens in
Token.EqualTo(Tokens.OpenCloseParen)
.Or(Token.EqualTo(Tokens.Hash))
.Or(Token.EqualTo(Tokens.Dollar))
.Or(Token.EqualTo(Tokens.Dot))
.Or(Token.EqualTo(Tokens.Number))
.Or(Token.EqualTo(Tokens.String))
.Or(Token.EqualTo(Tokens.Whitespace))
.Many()
// if this side of the pipe is all whitespace, return null
select (Node) (
tokens.All(x => x.ToStringValue() == " ")
? null
: new TextNode {
Value = string.Join("", tokens.Select(t => t.ToStringValue())).Trim()
}
);
/// <summary>
/// Parses a full expression that may contain text expressions or nested sub-expressions
/// e.g. "(a | b)", "( (a.c() | b) | (123 | c) )", etc.
/// </summary>
public readonly static TokenListParser<Tokens, Node> Expression =
from leadWs in OptionalWhitespace
from lp in Token.EqualTo(Tokens.LParen)
from nodes in TextExpression
.Or(Parse.Ref(() => Expression))
.ManyDelimitedBy(Token.EqualTo(Tokens.Pipe))
.OptionalOrDefault()
from rp in Token.EqualTo(Tokens.RParen)
from trailWs in OptionalWhitespace
where nodes.Length > 1 && nodes.Any(node => node != null) // has to have at least two sides and one has to be non-null
select (Node)new Expression {
Nodes = nodes.Select(node => node ?? new TextNode { Value = "" }).ToArray()
};
}
最后我们可以使用分词器和解析器来解析您的输入:
string input = "(a b | c.())";
var tokens = tokenizer.Tokenize(input);
var result = MyParsers.Expression.TryParse(tokens);
if (result.HasValue)
{
// input is valid
var expression = (Expression)result.Value;
// do what you need with it here, i.e. loop through the nodes, output the text, etc.
}
else
{
// not valid
}
这适用于几乎所有的测试用例,但像这样的 (()|())
除外,其中 open/close paren 是管道两侧的值。也可能有更好的方法来进行一些解析,因为我自己刚刚习惯了 Superpower,但我认为这是一个很好的基础,所以你可以优化它 and/or 整合你所有的优势案件进入。
编辑
是空格把一切都搞砸了。我必须在 Expression
解析器中添加更多的空格检查,并且还必须添加一个条件来检查非空 TextExpression
,然后还要检查可能为空的条件。这是为了处理管道一侧为空白的情况。这是工作解析器:
public readonly static TokenListParser<Tokens, Node> Expression =
from _1 in OptionalWhitespace
from lp in Token.EqualTo(Tokens.LParen)
from _2 in OptionalWhitespace
from nodes in
TextExpression.Where(node => node != null) // check for actual text node first
.Or(Expression)
.Or(TextExpression) // then check to see if it's empty
.ManyDelimitedBy(Token.EqualTo(Tokens.Pipe))
from _3 in OptionalWhitespace
from rp in Token.EqualTo(Tokens.RParen)
from _4 in OptionalWhitespace
where nodes.Length > 1 && nodes.Any(node => node != null) // has to have at least two sides and one has to be non-null
select (Node)new Expression {
Nodes = nodes.Select(node => node ?? new TextNode { Value = "" }).ToArray()
};
我正在努力为下面的部分输入集(嵌套的平衡括号和“|”分隔符)想出一个 superpower 解析器。
括号内可以包含任意文本,包括空格、其他标记和“()”。只有'|'、'('、')'在这里应该有特殊含义(换行符也会结束序列)。为了有效,每个平衡的、带括号的组必须有一个“|”和至少一个不是“(”或“)”的字符。
理想情况下,解析器会将每个输入拆分为一个列表,其中的元素可以是(终端)字符串,也可以是字符串数组,如下所示:
有效:
(a|) -> { "a", "" }
(a | b) -> { "a", "b" }
(a | b.c()) -> { "a", "b.c()" }
(aa | bb cc ) -> { "aa" "bb cc" }
(a | b | c #dd) -> { "a", "b", "c #dd"}
((a | b) | $c) -> { { "a", "b" }, "$c" }
((a | b) | (c | d)) -> { { "a", "b" }, { "c", "d" } }
(((a | b) | c) | d) -> { { { "a", "b" }, "c" }, "d" }
...
Invalid/ignored:
()
())
(()
(|)
(|())
(.)
(())
(()|())
(abc)
(a bc)
(a.bc())
...
我的代币(用于此处)如下:
public enum Tokens
{
[Token(Example = "(")]
LParen,
[Token(Example = ")")]
RParen,
[Token(Example = "|")]
Pipe,
[Token(Description = "everything-else")]
String
}
这很棘手,主要是因为您需要保留空白,但我能够想出一个满足您需要的解析器。首先,我不得不稍微改变你的 Tokens
枚举:
public enum Tokens
{
None,
String,
Number,
[Token(Example = "()")]
OpenCloseParen,
[Token(Example = "(")]
LParen,
[Token(Example = ")")]
RParen,
[Token(Example = "#")]
Hash,
[Token(Example = "$")]
Dollar,
[Token(Example = "|")]
Pipe,
[Token(Example = ".")]
Dot,
[Token(Example = " ")]
Whitespace,
}
接下来,我们可以构建如下Tokenizer
:
var tokenizer = new TokenizerBuilder<Tokens>()
.Match(Span.EqualTo("()"), Tokens.OpenCloseParen)
.Match(Character.EqualTo('('), Tokens.LParen)
.Match(Character.EqualTo(')'), Tokens.RParen)
.Match(Character.EqualTo('#'), Tokens.Hash)
.Match(Character.EqualTo('$'), Tokens.Dollar)
.Match(Character.EqualTo('.'), Tokens.Dot)
.Match(Character.EqualTo('|'), Tokens.Pipe)
.Match(Character.EqualTo(' '), Tokens.Whitespace)
.Match(Span.MatchedBy(Character.AnyChar), Tokens.String)
.Match(Numerics.Natural, Tokens.Number)
.Build();
接下来,创建模型 类 来保存输出(您可能会想到更好的名称,因为我不确定您正在解析的数据到底是什么):
public abstract class Node
{
}
public class TextNode : Node
{
public string Value { get; set; }
}
public class Expression : Node
{
public Node[] Nodes { get; set; }
}
然后我们创建解析器:
public static class MyParsers
{
/// <summary>
/// Parses any whitespace (if any) and returns a resulting string
/// </summary>
public readonly static TokenListParser<Tokens, string> OptionalWhitespace =
from chars in Token.EqualTo(Tokens.Whitespace).Many().OptionalOrDefault()
select chars == null ? "" : new string(' ', chars.Length);
/// <summary>
/// Parses a valid text expression
/// e.g. "abc", "a.c()", "$c", etc.
/// </summary>
public readonly static TokenListParser<Tokens, Node> TextExpression =
from tokens in
Token.EqualTo(Tokens.OpenCloseParen)
.Or(Token.EqualTo(Tokens.Hash))
.Or(Token.EqualTo(Tokens.Dollar))
.Or(Token.EqualTo(Tokens.Dot))
.Or(Token.EqualTo(Tokens.Number))
.Or(Token.EqualTo(Tokens.String))
.Or(Token.EqualTo(Tokens.Whitespace))
.Many()
// if this side of the pipe is all whitespace, return null
select (Node) (
tokens.All(x => x.ToStringValue() == " ")
? null
: new TextNode {
Value = string.Join("", tokens.Select(t => t.ToStringValue())).Trim()
}
);
/// <summary>
/// Parses a full expression that may contain text expressions or nested sub-expressions
/// e.g. "(a | b)", "( (a.c() | b) | (123 | c) )", etc.
/// </summary>
public readonly static TokenListParser<Tokens, Node> Expression =
from leadWs in OptionalWhitespace
from lp in Token.EqualTo(Tokens.LParen)
from nodes in TextExpression
.Or(Parse.Ref(() => Expression))
.ManyDelimitedBy(Token.EqualTo(Tokens.Pipe))
.OptionalOrDefault()
from rp in Token.EqualTo(Tokens.RParen)
from trailWs in OptionalWhitespace
where nodes.Length > 1 && nodes.Any(node => node != null) // has to have at least two sides and one has to be non-null
select (Node)new Expression {
Nodes = nodes.Select(node => node ?? new TextNode { Value = "" }).ToArray()
};
}
最后我们可以使用分词器和解析器来解析您的输入:
string input = "(a b | c.())";
var tokens = tokenizer.Tokenize(input);
var result = MyParsers.Expression.TryParse(tokens);
if (result.HasValue)
{
// input is valid
var expression = (Expression)result.Value;
// do what you need with it here, i.e. loop through the nodes, output the text, etc.
}
else
{
// not valid
}
这适用于几乎所有的测试用例,但像这样的 (()|())
除外,其中 open/close paren 是管道两侧的值。也可能有更好的方法来进行一些解析,因为我自己刚刚习惯了 Superpower,但我认为这是一个很好的基础,所以你可以优化它 and/or 整合你所有的优势案件进入。
编辑
是空格把一切都搞砸了。我必须在 Expression
解析器中添加更多的空格检查,并且还必须添加一个条件来检查非空 TextExpression
,然后还要检查可能为空的条件。这是为了处理管道一侧为空白的情况。这是工作解析器:
public readonly static TokenListParser<Tokens, Node> Expression =
from _1 in OptionalWhitespace
from lp in Token.EqualTo(Tokens.LParen)
from _2 in OptionalWhitespace
from nodes in
TextExpression.Where(node => node != null) // check for actual text node first
.Or(Expression)
.Or(TextExpression) // then check to see if it's empty
.ManyDelimitedBy(Token.EqualTo(Tokens.Pipe))
from _3 in OptionalWhitespace
from rp in Token.EqualTo(Tokens.RParen)
from _4 in OptionalWhitespace
where nodes.Length > 1 && nodes.Any(node => node != null) // has to have at least two sides and one has to be non-null
select (Node)new Expression {
Nodes = nodes.Select(node => node ?? new TextNode { Value = "" }).ToArray()
};