解析多个项目的正确方法

Proper way to parse multiple items

我有一个包含多个行和字段的输入文件,由 space 分隔。我的定义文件是:

scanner.xrl:

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

parser.yrl:

Nonterminals line.

Terminals string.

Rootsymbol line.

Endsymbol new_line.

line -> string : [''].
line -> string line: [''|''].

Erlang code.

当运行原样时,解析第一行然后停止:

1> A = <<"a b c\nd e\nf\n">>.

2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {new_line,1},
     {string,2,"d"},
     {string,2,"e"},
     {new_line,2},
     {string,3,"f"},
     {new_line,3}],
    4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}

如果我从 parser.yrl 中删除 Endsymbol 行并更改 scanner.xrl 文件如下:

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

我所有的行都被解析为一个项目:

1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {string,2,"d"},
     {string,2,"e"},
     {string,3,"f"}],
    4}
3> parser:parse(T).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {string,2,"d"},
     {string,2,"e"},
     {string,3,"f"}]}

向解析器发出信号将每一行都应视为单独的项目的正确方法是什么?我希望我的结果看起来像:

{ok,[[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"}],
     [{string,2,"d"},
     {string,2,"e"}],
     [{string,3,"f"}]]}

这是正确的 lexer/parser 对之一,它只用 1 shift/reduce 完成工作,但我认为它会解决你的问题,你只需要按照你的喜好清理标记。

我很确定可以有更简单、更快速的方法来做到这一点,但在我的“词法分析器战斗时间”期间,很难找到至少一些信息,我希望这能给出如何继续使用 Erlang 进行解析。

scanner.xrl

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t)+ : skip_token.
\n : {token, {line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

parser.yrl

Nonterminals 
    Lines
    Line
    Strings.

Terminals string line.

Rootsymbol Lines.

Lines -> Line Lines : lists:flatten(['', '']).
Lines -> Line : lists:flatten(['']).

Line -> Strings line : {line, lists:flatten([''])}.
Line -> Strings : {line, lists:flatten([''])}.

Strings -> string Strings : lists:append([''], '').
Strings -> string : lists:flatten(['']).

Erlang code.

输出

{ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
     {line,[{string,2,"d"},{string,2,"e"}]},
     {line,[{string,3,"f"}]}]}

解析器流程如下:

  • 根定义为抽象的“线”
  • “Lines”包含“Line + Lines”或简单的“Line”,这给出了循环
  • “行”在文件末尾时包含来自“字符串 + 行”或简单的“字符串”
  • “字符串”包含来自 'string' 或“'string' + 字符串”,当提供许多字符串时
  • 'line'是'\n'符号

请允许我对我在原始代码中发现的问题提出一些意见。

  • 您应该将整个文件视为一个嵌套数组,而不是像逐行解析一样,这就是 Lines/Line 提供摘要的原因
  • “Terminals”意味着不会分析标记是否包含任何其他标记,“Nonterminals”将被进一步评估,这些是复杂的数据