使用 antlr 解析邮件
Parse emails using antlr
我尝试了整整一周的时间来使用 antlr 构建一个允许我解析电子邮件的语法。
我的目标是不是将整封电子邮件详尽地解析为标记,而是解析为相关部分。
这是我必须处理的文档格式。 //
描述不属于消息一部分的内联评论:
Subject : [SUBJECT_MARKER] + lorem ipsum...
// marks a message that needs to be parsed.
// Subject marker can be something like "help needed", "action required"
Body:
// irrelevant text we can ignore, discard or skip
Hi George,
Hope you had a good weekend. Another fluff sentence ...
// end of irrelevant text
// beginning of the SECTION_TYPE_1. SECTION_TYPE_1 marker is "answers below:"
[SECTION_TYPE_1]
Meaningful text block that needs capturing, made of many sentences: Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // this is "\n\n"
// SENTENCE_MARKER can be "a)", "b)" or anything that is in the form "[a-zA-Z]')'"
// one important requirement is that this SENTENCE_MARKER matches only inside a section. Either SECTION_TYPE_1 or SECTION_TYPE_2
// alternatively instead of [SECTION_TYPE_1] we can have [SECTION_TYPE_2].
// if we have SECTION_TYPE_1 then try to parse SECTION_TYPE_1 else try to parse SECTION_TYPE_2.enter code here
[SECTION_TYPE_2] // beginning of the section type 1;
Meaningful text bloc that needs capturing. Many sentences Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // same as above
我面临的问题如下:
- 我没有想出跳过开头文本的好方法
消息并仅在标记具有后才开始应用解析规则
被发现。 SECTION_TYPE_1
- 捕获节开头和句子标记之间节内的所有文本
- 在 SECTION_END 标记之后,忽略后面的所有文本。
Antlr 是结构化文本的解析器,最好是结构化明确的文本。除非您的源消息具有相对明确定义的功能,可以可靠地标记感兴趣的消息部分,否则 Antlr 不太可能工作。
更好的方法是使用自然语言处理器 (NLP) 包来识别每个句子或短语的形式和宾语,从而识别出感兴趣的内容。 Stanford NLP package is quite well known (Github)。
更新
必要的语法形式为:
message : subject ( sec1 | sec2 | fluff )* EOF ;
subject : fluff* SUBJECT_MARKER subText EOL ;
subText : ( word | HWS )+ ;
sec1 : ( SECTION_TYPE_1 content )+ SECTION_END_MARKER ;
sec2 : ( SECTION_TYPE_2 content )+ SECTION_END_MARKER ;
content : ( word | ws )+ ;
word : CHAR+ ;
ws : ( EOL | HWS )+ ;
fluff : . ;
SUBJECT_MARKER : 'marker' ;
SECTION_TYPE_1 : 'text1' ;
SECTION_TYPE_2 : 'text2' ;
SENTENCE_MARKER : [a-zA-Z0-9] ')' ;
EOL : '\r'? '\n';
HWS : [ \t] ;
CHAR : . ;
成功将取决于各种标记的明确程度——而且肯定会有歧义。要么修改语法以显式处理歧义,要么推迟到 tree-walk/analysis 阶段来解决。
我尝试了整整一周的时间来使用 antlr 构建一个允许我解析电子邮件的语法。
我的目标是不是将整封电子邮件详尽地解析为标记,而是解析为相关部分。
这是我必须处理的文档格式。 //
描述不属于消息一部分的内联评论:
Subject : [SUBJECT_MARKER] + lorem ipsum...
// marks a message that needs to be parsed.
// Subject marker can be something like "help needed", "action required"
Body:
// irrelevant text we can ignore, discard or skip
Hi George,
Hope you had a good weekend. Another fluff sentence ...
// end of irrelevant text
// beginning of the SECTION_TYPE_1. SECTION_TYPE_1 marker is "answers below:"
[SECTION_TYPE_1]
Meaningful text block that needs capturing, made of many sentences: Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // this is "\n\n"
// SENTENCE_MARKER can be "a)", "b)" or anything that is in the form "[a-zA-Z]')'"
// one important requirement is that this SENTENCE_MARKER matches only inside a section. Either SECTION_TYPE_1 or SECTION_TYPE_2
// alternatively instead of [SECTION_TYPE_1] we can have [SECTION_TYPE_2].
// if we have SECTION_TYPE_1 then try to parse SECTION_TYPE_1 else try to parse SECTION_TYPE_2.enter code here
[SECTION_TYPE_2] // beginning of the section type 1;
Meaningful text bloc that needs capturing. Many sentences Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // same as above
我面临的问题如下:
- 我没有想出跳过开头文本的好方法 消息并仅在标记具有后才开始应用解析规则 被发现。 SECTION_TYPE_1
- 捕获节开头和句子标记之间节内的所有文本
- 在 SECTION_END 标记之后,忽略后面的所有文本。
Antlr 是结构化文本的解析器,最好是结构化明确的文本。除非您的源消息具有相对明确定义的功能,可以可靠地标记感兴趣的消息部分,否则 Antlr 不太可能工作。
更好的方法是使用自然语言处理器 (NLP) 包来识别每个句子或短语的形式和宾语,从而识别出感兴趣的内容。 Stanford NLP package is quite well known (Github)。
更新
必要的语法形式为:
message : subject ( sec1 | sec2 | fluff )* EOF ;
subject : fluff* SUBJECT_MARKER subText EOL ;
subText : ( word | HWS )+ ;
sec1 : ( SECTION_TYPE_1 content )+ SECTION_END_MARKER ;
sec2 : ( SECTION_TYPE_2 content )+ SECTION_END_MARKER ;
content : ( word | ws )+ ;
word : CHAR+ ;
ws : ( EOL | HWS )+ ;
fluff : . ;
SUBJECT_MARKER : 'marker' ;
SECTION_TYPE_1 : 'text1' ;
SECTION_TYPE_2 : 'text2' ;
SENTENCE_MARKER : [a-zA-Z0-9] ')' ;
EOL : '\r'? '\n';
HWS : [ \t] ;
CHAR : . ;
成功将取决于各种标记的明确程度——而且肯定会有歧义。要么修改语法以显式处理歧义,要么推迟到 tree-walk/analysis 阶段来解决。