如何使用 ANTLR 解析具有多值字段的 CSV 文件?
How to parse a CSV file that has a multivalued field with ANTLR?
我的任务是解析一个 CSV 文件,该文件在其他公共字段中有一个多值字段。该文件如下所示:
AEIO;AEIO;Some random text - lots of possible characters;"Property A: Yes
Property B: XXXXX
Property C: Some value
Property D: 2
Property E: Some text again"
BBBZ;ANOTHERONE;AGAIN - Many possible characters/Like this;"Property A: Yes
Property B: some text
Property AB: more text yet
Property Z: http://someurl.com"
0123;TEXT;More - long text here;"Property A: Yep
Property M: this value is pretty long!
Property B: Yes
This property has a weird name: and the value has numbers too 2.0
Property Z: ahostname, comma-separated
Property K: anything"
字段值以分号分隔。多值字段包含 属性 值对,它们由换行符(有时是回车 return)彼此分隔。多值字段中的 属性 名称与其值之间用冒号分隔。所有字段始终存在并且总是至少有一个多值 属性.
我决定尝试编写 ANTLR 4 语法来解析此文件。下面贴出我的作品。
file : row+ ;
row : identifier FIELD_SEP code FIELD_SEP name FIELD_SEP properties;
identifier: TEXT;
code : TEXT;
name : TEXT;
properties : PROP_DELIM property_and_value (NEWLINE property_and_value)* PROP_DELIM NEWLINE;
property_and_value: TEXT (PROP_VAL_DELIM PROP_VALUE)?;
TEXT : ~[\r\n";:]+;
PROP_VALUE : ~[\r\n";]+;
NEWLINE : [\r\n]+;
PROP_DELIM : '"';
FIELD_SEP : ';';
PROP_VAL_DELIM: ':';
我已经部分成功地解析了该文件,但我未能从多值字段中正确读取 属性 名称和值对。例如,当我尝试阅读上面的示例时,出现以下错误:
line 1:58 mismatched input 'Property A: Yes' expecting TEXT
line 2:0 mismatched input 'Property B: XXXXX' expecting TEXT
line 3:0 mismatched input 'Property C: Some value' expecting TEXT
line 4:0 mismatched input 'Property D: 2' expecting TEXT
line 5:0 mismatched input 'Property E: Some text again' expecting TEXT
line 6:60 mismatched input 'Property A: Yes' expecting TEXT
line 7:0 mismatched input 'Property B: some text' expecting TEXT
line 8:0 mismatched input 'Property AB: more text yet' expecting TEXT
line 9:0 mismatched input 'Property Z: http://someurl.com' expecting TEXT
line 10:34 mismatched input 'Property A: Yep' expecting TEXT
line 11:0 mismatched input 'Property M: this value is pretty long!' expecting TEXT
line 12:0 mismatched input 'Property B: Yes' expecting TEXT
line 13:0 mismatched input 'This property has a weird name: and the value has numbers too 2.0' expecting TEXT
line 14:0 mismatched input 'Property Z: ahostname, comma-separated' expecting TEXT
line 15:0 mismatched input 'Property K: anything' expecting TEXT
我不确定我做错了什么,所以我请求你的帮助。 如何才能正确读取此 CSV 文件而不出错?
TEXT
和 PROP_VALUE
的词法分析器规则冲突。
通常 ANTLR4 更喜欢更长的匹配并且通常 PROP_VALUE
为最长的匹配生成标记(因为它可以匹配文本和 :
等所有内容)在您的示例中 AEIO;AEIO;Some random text - lots of possible characters;
它不会,因为 TEXT
和 PROP_VALUE
的匹配长度相同。在这种情况下,第一个规则确定发出的令牌。
解决此问题:
- 看起来词法分析器规则是分离的(至少对于关键模式而言)
- 即删除
PROP_VALUE
的定义并将其出现替换为 (TEXT | PROP_VAL_DELIM)+
(或等效的解析器子规则)
例如
property_and_value: TEXT (PROP_VAL_DELIM (TEXT | PROP_VAL_DELIM)+)?;
TEXT : ~[\r\n";:]+;
NEWLINE : [\r\n]+;
PROP_DELIM : '"';
FIELD_SEP : ';';
PROP_VAL_DELIM: ':';
我的任务是解析一个 CSV 文件,该文件在其他公共字段中有一个多值字段。该文件如下所示:
AEIO;AEIO;Some random text - lots of possible characters;"Property A: Yes
Property B: XXXXX
Property C: Some value
Property D: 2
Property E: Some text again"
BBBZ;ANOTHERONE;AGAIN - Many possible characters/Like this;"Property A: Yes
Property B: some text
Property AB: more text yet
Property Z: http://someurl.com"
0123;TEXT;More - long text here;"Property A: Yep
Property M: this value is pretty long!
Property B: Yes
This property has a weird name: and the value has numbers too 2.0
Property Z: ahostname, comma-separated
Property K: anything"
字段值以分号分隔。多值字段包含 属性 值对,它们由换行符(有时是回车 return)彼此分隔。多值字段中的 属性 名称与其值之间用冒号分隔。所有字段始终存在并且总是至少有一个多值 属性.
我决定尝试编写 ANTLR 4 语法来解析此文件。下面贴出我的作品。
file : row+ ;
row : identifier FIELD_SEP code FIELD_SEP name FIELD_SEP properties;
identifier: TEXT;
code : TEXT;
name : TEXT;
properties : PROP_DELIM property_and_value (NEWLINE property_and_value)* PROP_DELIM NEWLINE;
property_and_value: TEXT (PROP_VAL_DELIM PROP_VALUE)?;
TEXT : ~[\r\n";:]+;
PROP_VALUE : ~[\r\n";]+;
NEWLINE : [\r\n]+;
PROP_DELIM : '"';
FIELD_SEP : ';';
PROP_VAL_DELIM: ':';
我已经部分成功地解析了该文件,但我未能从多值字段中正确读取 属性 名称和值对。例如,当我尝试阅读上面的示例时,出现以下错误:
line 1:58 mismatched input 'Property A: Yes' expecting TEXT
line 2:0 mismatched input 'Property B: XXXXX' expecting TEXT
line 3:0 mismatched input 'Property C: Some value' expecting TEXT
line 4:0 mismatched input 'Property D: 2' expecting TEXT
line 5:0 mismatched input 'Property E: Some text again' expecting TEXT
line 6:60 mismatched input 'Property A: Yes' expecting TEXT
line 7:0 mismatched input 'Property B: some text' expecting TEXT
line 8:0 mismatched input 'Property AB: more text yet' expecting TEXT
line 9:0 mismatched input 'Property Z: http://someurl.com' expecting TEXT
line 10:34 mismatched input 'Property A: Yep' expecting TEXT
line 11:0 mismatched input 'Property M: this value is pretty long!' expecting TEXT
line 12:0 mismatched input 'Property B: Yes' expecting TEXT
line 13:0 mismatched input 'This property has a weird name: and the value has numbers too 2.0' expecting TEXT
line 14:0 mismatched input 'Property Z: ahostname, comma-separated' expecting TEXT
line 15:0 mismatched input 'Property K: anything' expecting TEXT
我不确定我做错了什么,所以我请求你的帮助。 如何才能正确读取此 CSV 文件而不出错?
TEXT
和 PROP_VALUE
的词法分析器规则冲突。
通常 ANTLR4 更喜欢更长的匹配并且通常 PROP_VALUE
为最长的匹配生成标记(因为它可以匹配文本和 :
等所有内容)在您的示例中 AEIO;AEIO;Some random text - lots of possible characters;
它不会,因为 TEXT
和 PROP_VALUE
的匹配长度相同。在这种情况下,第一个规则确定发出的令牌。
解决此问题:
- 看起来词法分析器规则是分离的(至少对于关键模式而言)
- 即删除
PROP_VALUE
的定义并将其出现替换为(TEXT | PROP_VAL_DELIM)+
(或等效的解析器子规则)
例如
property_and_value: TEXT (PROP_VAL_DELIM (TEXT | PROP_VAL_DELIM)+)?;
TEXT : ~[\r\n";:]+;
NEWLINE : [\r\n]+;
PROP_DELIM : '"';
FIELD_SEP : ';';
PROP_VAL_DELIM: ':';