解析多个项目的正确方法
Proper way to parse multiple items
我有一个包含多个行和字段的输入文件,由 space 分隔。我的定义文件是:
scanner.xrl
:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl
:
Nonterminals line.
Terminals string.
Rootsymbol line.
Endsymbol new_line.
line -> string : [''].
line -> string line: [''|''].
Erlang code.
当运行原样时,解析第一行然后停止:
1> A = <<"a b c\nd e\nf\n">>.
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{new_line,1},
{string,2,"d"},
{string,2,"e"},
{new_line,2},
{string,3,"f"},
{new_line,3}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}
如果我从 parser.yrl
中删除 Endsymbol
行并更改 scanner.xrl
文件如下:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
我所有的行都被解析为一个项目:
1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}]}
向解析器发出信号将每一行都应视为单独的项目的正确方法是什么?我希望我的结果看起来像:
{ok,[[{string,1,"a"},
{string,1,"b"},
{string,1,"c"}],
[{string,2,"d"},
{string,2,"e"}],
[{string,3,"f"}]]}
这是正确的 lexer/parser 对之一,它只用 1 shift/reduce 完成工作,但我认为它会解决你的问题,你只需要按照你的喜好清理标记。
我很确定可以有更简单、更快速的方法来做到这一点,但在我的“词法分析器战斗时间”期间,很难找到至少一些信息,我希望这能给出如何继续使用 Erlang 进行解析。
scanner.xrl
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {token, {line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl
Nonterminals
Lines
Line
Strings.
Terminals string line.
Rootsymbol Lines.
Lines -> Line Lines : lists:flatten(['', '']).
Lines -> Line : lists:flatten(['']).
Line -> Strings line : {line, lists:flatten([''])}.
Line -> Strings : {line, lists:flatten([''])}.
Strings -> string Strings : lists:append([''], '').
Strings -> string : lists:flatten(['']).
Erlang code.
输出
{ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
{line,[{string,2,"d"},{string,2,"e"}]},
{line,[{string,3,"f"}]}]}
解析器流程如下:
- 根定义为抽象的“线”
- “Lines”包含“Line + Lines”或简单的“Line”,这给出了循环
- “行”在文件末尾时包含来自“字符串 + 行”或简单的“字符串”
- “字符串”包含来自 'string' 或“'string' + 字符串”,当提供许多字符串时
- 'line'是'\n'符号
请允许我对我在原始代码中发现的问题提出一些意见。
- 您应该将整个文件视为一个嵌套数组,而不是像逐行解析一样,这就是 Lines/Line 提供摘要的原因
- “Terminals”意味着不会分析标记是否包含任何其他标记,“Nonterminals”将被进一步评估,这些是复杂的数据
我有一个包含多个行和字段的输入文件,由 space 分隔。我的定义文件是:
scanner.xrl
:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl
:
Nonterminals line.
Terminals string.
Rootsymbol line.
Endsymbol new_line.
line -> string : [''].
line -> string line: [''|''].
Erlang code.
当运行原样时,解析第一行然后停止:
1> A = <<"a b c\nd e\nf\n">>.
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{new_line,1},
{string,2,"d"},
{string,2,"e"},
{new_line,2},
{string,3,"f"},
{new_line,3}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}
如果我从 parser.yrl
中删除 Endsymbol
行并更改 scanner.xrl
文件如下:
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
我所有的行都被解析为一个项目:
1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}],
4}
3> parser:parse(T).
{ok,[{string,1,"a"},
{string,1,"b"},
{string,1,"c"},
{string,2,"d"},
{string,2,"e"},
{string,3,"f"}]}
向解析器发出信号将每一行都应视为单独的项目的正确方法是什么?我希望我的结果看起来像:
{ok,[[{string,1,"a"},
{string,1,"b"},
{string,1,"c"}],
[{string,2,"d"},
{string,2,"e"}],
[{string,3,"f"}]]}
这是正确的 lexer/parser 对之一,它只用 1 shift/reduce 完成工作,但我认为它会解决你的问题,你只需要按照你的喜好清理标记。
我很确定可以有更简单、更快速的方法来做到这一点,但在我的“词法分析器战斗时间”期间,很难找到至少一些信息,我希望这能给出如何继续使用 Erlang 进行解析。
scanner.xrl
Definitions.
DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]
Rules.
(\s|\t)+ : skip_token.
\n : {token, {line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
Erlang code.
parser.yrl
Nonterminals
Lines
Line
Strings.
Terminals string line.
Rootsymbol Lines.
Lines -> Line Lines : lists:flatten(['', '']).
Lines -> Line : lists:flatten(['']).
Line -> Strings line : {line, lists:flatten([''])}.
Line -> Strings : {line, lists:flatten([''])}.
Strings -> string Strings : lists:append([''], '').
Strings -> string : lists:flatten(['']).
Erlang code.
输出
{ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
{line,[{string,2,"d"},{string,2,"e"}]},
{line,[{string,3,"f"}]}]}
解析器流程如下:
- 根定义为抽象的“线”
- “Lines”包含“Line + Lines”或简单的“Line”,这给出了循环
- “行”在文件末尾时包含来自“字符串 + 行”或简单的“字符串”
- “字符串”包含来自 'string' 或“'string' + 字符串”,当提供许多字符串时
- 'line'是'\n'符号
请允许我对我在原始代码中发现的问题提出一些意见。
- 您应该将整个文件视为一个嵌套数组,而不是像逐行解析一样,这就是 Lines/Line 提供摘要的原因
- “Terminals”意味着不会分析标记是否包含任何其他标记,“Nonterminals”将被进一步评估,这些是复杂的数据