当表达式可以有多种可能的形式时如何编写语法
How to write grammar for an expression when it can have many possible forms
我有一些句子需要转换为正则表达式代码,我正尝试对其使用 Pyparsing。这些句子基本上是搜索规则,告诉我们要搜索什么。
句子示例 -
LINE_CONTAINS this is a phrase
- 这是一个示例搜索规则,告诉您正在搜索的行应该包含短语 this is a phrase
LINE_STARTSWITH However we
- 这是一个示例搜索规则,告诉您要搜索的行应该以短语 However we
开头
规则也可以组合,比如- LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we
可以找到所有实际句子的列表(如果需要)here。
所有行都以上面提到的两个符号之一开始(称它们为 line_directives)。现在,我正在尝试解析这些句子,然后将它们转换为正则表达式代码。我开始为我的语法写一个 BNF,这就是我想出的 -
lpar ::= '{'
rpar ::= '}'
line_directive ::= LINE_CONTAINS | LINE_STARTSWITH
phrase ::= lpar(?) + (word+) + rpar(?) # meaning if a phrase is parenthesized, its still the same
upto_N_words ::= lpar + 'UPTO' + num + 'WORDS' + rpar
N_words ::= lpar + num + 'WORDS' + rpar
upto_N_characters ::= lpar + 'UPTO' + num + 'CHARACTERS' + rpar
N_characters ::= lpar + num + 'CHARACTERS' + rpar
JOIN_phrase ::= phrase + JOIN + phrase
AND_phrase ::= phrase (+ JOIN + phrase)+
OR_phrase ::= phrase (+ OR + phrase)+
BEFORE_phrase ::= phrase (+ BEFORE + phrase)+
AFTER_phrase ::= phrase (+ AFTER + phrase)+
braced_OR_phrase ::= lpar + OR_phrase + rpar
braced_AND_phrase ::= lpar + AND_phrase + rpar
braced_BEFORE_phrase ::= lpar + BEFORE_phrase + rpar
braced_AFTER_phrase ::= lpar + AFTER_phrase + rpar
braced_JOIN_phrase ::= lpar + JOIN_phrase + rpar
rule ::= line_directive + subrule
final_expr ::= rule (+ AND/OR + rule)+
问题是 subrule
,为此(根据我的经验数据)我已经能够想出以下所有表达式 -
subrule ::= phrase
::= OR_phrase
::= JOIN_phrase
::= BEFORE_phrase
::= AFTER_phrase
::= AND_phrase
::= phrase + upto_N_words + phrase
::= braced_OR_phrase + phrase
::= phrase + braced_OR_phrase
::= phrase + braced_OR_phrase + phrase
::= phrase + upto_N_words + braced_OR_phrase
::= phrase + upto_N_characters + phrase
::= braced_OR_phrase + phrase + upto_N_words + phrase
::= phrase + braced_OR_phrase + upto_N_words + phrase
举个例子,我有一句话是LINE_CONTAINS the objective of this study was {to identify OR identifying} genes upregulated
。为此,上面提到的子规则是 phrase + braced_OR_phrase + phrase
。
所以我的问题是如何为 subrule
编写一个简单的 BNF 语法表达式,以便我能够使用 Pyparsing 轻松地为其编写语法代码?此外,绝对欢迎任何有关我目前技术的意见。
编辑: 在应用@Paul 在他的回答中阐明的原则后,这里是代码的 MCVE 版本。它需要一个要解析的句子列表 hrrsents
,解析每个句子,将其转换为相应的正则表达式和 returns 正则表达式字符串列表 -
from pyparsing import *
import re
def parse_hrr(hrrsents):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=^)|(?<=[\W_]))" + a + r"(?=[\W_]|$))456",
'PARA_STARTSWITH':
lambda a: ( r"(^" + a + r"(?=[\W_]|$))457") if 'gene' in repr(a)
else (r"(^" + a + r"(?=[\W_]|$))458")}
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
class LineAndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '&&&'.join(t.generate() for t in tokens[::2])
class LineOrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '@@@'.join(t.generate() for t in tokens[::2])
class UpToWordsNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
word_re = r"([\w]+\s*)"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "(%s{0,%d})" % (word_re, op)
return ret
class UpToCharactersNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
char_re = r"\w"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d})" % (char_re, op)
return ret
class BeforeAfterJoinNode(Node):
def generate(self):
tokens = self.tokens[0]
operator_opn_map = {'BEFORE': lambda a,b: a + '.*?' + b, 'AFTER': lambda a,b: b + '.*?' + a, 'JOIN': lambda a,b: a + '[- ]?' + b}
ret = tokens[0].generate()
for operator, operand in zip(tokens[1::2], tokens[2::2]):
ret = operator_opn_map[operator](ret, operand.generate()) # this is basically calling a dict element, and every such element requires 2 variables (a&b), so providing them as ret and op.generate
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
uptowords_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE).setParseAction(UpToWordsNode)
uptochars_expr = Group(LBRACE + UPTO + integer("numberofchars") + CHARACTERS + RBRACE).setParseAction(UpToCharactersNode)
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words | uptowords_expr | uptochars_expr
phrase_expr = infixNotation(phrase_item,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT, BeforeAfterJoinNode), # was not working earlier, because BEFORE etc. were not keywords, and hence parsed as words
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
#
line_contents_expr = infixNotation(line_term.setParseAction(LineTermNode),
[(AND, 2, opAssoc.LEFT, LineAndNode),
(OR, 2, opAssoc.LEFT, LineOrNode),
]
) # grammar for the entire rule/sentence
######################################
mrrlist=[]
for t in hrrsents:
t = t.strip()
if not t:
continue
try:
parsed = line_contents_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
temp_regex = parsed[0].generate()
final_regexes3 = re.sub(r'gene','%s',temp_regex) # this can be made more precise by putting a condition of [non-word/^/$] around the 'gene'
mrrlist.append(final_regexes3)
return(mrrlist)
你这里有一个两层语法,所以你最好一次只关注一层,我们已经在你的其他一些问题中提到了这一点。下层是 phrase_expr
的层,稍后将作为 line_directive_expr
的参数。因此,首先定义短语表达式的示例 - 从完整语句示例列表中提取它们。 phrase_expr
的最终 BNF 将具有最低级别的递归,如下所示:
phrase_atom ::= <one or more types of terminal items, like words of characters
or quoted strings, or *possibly* expressions of numbers of
words or characters> | brace + phrase_expr + brace`
(其他一些问题:是否可以有多个phrase_items一个接一个没有运算符?这说明什么?应该如何解析?解释?这个隐含的操作是否应该是它自己的级别优先级?)
这足以循环回您的短语表达式的递归 - 您的 BNF 中不需要任何其他 braced_xxx
元素。 AND、OR 和 JOIN 显然是二元运算符 - 在正常操作优先级中,AND 在 OR 之前求值,您可以自己决定 JOIN 应该落在哪个位置。写一些不带括号的示例短语,使用 AND 和 JOIN,以及 OR 和 JOIN,并思考在您的域中什么样的评估顺序是有意义的。
完成后,line_directive_expr
应该很简单,因为它只是:
line_directive_item ::= line_directive phrase_expr | brace line_directive_expr brace
line_directive_and ::= line_directive_item (AND line_directive_item)*
line_directive_or ::= line_directive_and (OR line_directive_and)*
line_directive_expr ::= line_directive_or
然后当您翻译成 pyparsing 时,添加组和结果名称一次一点!不要立即将所有内容分组或命名所有内容。通常我建议自由地使用结果名称,但在中缀符号语法中,大量的结果名称只会使结果变得混乱。让组(最终是节点 classes)进行结构化,节点 classes 中的行为将引导您将结果命名到您想要的位置。就此而言,结果 classes 通常具有如此简单的结构,以至于通常更容易在 class init 或 evaluate 方法中进行列表解包。 从简单的表达式开始,逐步扩展到复杂的表达式。(看看 "LINE_STARTSWITH gene"
- 它是您最简单的测试用例之一,但您将它作为 #97?)如果您只需按长度顺序对列表进行排序,这将是一个很好的粗略剪辑。或按增加的运算符数量排序。但是,在让简单的案例开始工作之前先处理复杂的案例,对于应该在哪里进行调整或改进,你会有太多的选择,而且(从个人经验来看)你做错的可能性和做对的可能性一样大——除非你错了,这只会让解决下一个问题变得更加困难。
而且,正如我们在其他地方讨论过的那样,第二层中的魔鬼正在对各种行指令项进行实际解释,因为存在评估 LINE_STARTSWITH 与 [=36= 的隐含顺序] 覆盖它们可能在初始字符串中找到的顺序。这个球完全在你的球场上,因为你是这个特定领域的语言设计师。
我有一些句子需要转换为正则表达式代码,我正尝试对其使用 Pyparsing。这些句子基本上是搜索规则,告诉我们要搜索什么。
句子示例 -
LINE_CONTAINS this is a phrase
- 这是一个示例搜索规则,告诉您正在搜索的行应该包含短语this is a phrase
LINE_STARTSWITH However we
- 这是一个示例搜索规则,告诉您要搜索的行应该以短语However we
开头
规则也可以组合,比如-
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we
可以找到所有实际句子的列表(如果需要)here。
所有行都以上面提到的两个符号之一开始(称它们为 line_directives)。现在,我正在尝试解析这些句子,然后将它们转换为正则表达式代码。我开始为我的语法写一个 BNF,这就是我想出的 -
lpar ::= '{'
rpar ::= '}'
line_directive ::= LINE_CONTAINS | LINE_STARTSWITH
phrase ::= lpar(?) + (word+) + rpar(?) # meaning if a phrase is parenthesized, its still the same
upto_N_words ::= lpar + 'UPTO' + num + 'WORDS' + rpar
N_words ::= lpar + num + 'WORDS' + rpar
upto_N_characters ::= lpar + 'UPTO' + num + 'CHARACTERS' + rpar
N_characters ::= lpar + num + 'CHARACTERS' + rpar
JOIN_phrase ::= phrase + JOIN + phrase
AND_phrase ::= phrase (+ JOIN + phrase)+
OR_phrase ::= phrase (+ OR + phrase)+
BEFORE_phrase ::= phrase (+ BEFORE + phrase)+
AFTER_phrase ::= phrase (+ AFTER + phrase)+
braced_OR_phrase ::= lpar + OR_phrase + rpar
braced_AND_phrase ::= lpar + AND_phrase + rpar
braced_BEFORE_phrase ::= lpar + BEFORE_phrase + rpar
braced_AFTER_phrase ::= lpar + AFTER_phrase + rpar
braced_JOIN_phrase ::= lpar + JOIN_phrase + rpar
rule ::= line_directive + subrule
final_expr ::= rule (+ AND/OR + rule)+
问题是 subrule
,为此(根据我的经验数据)我已经能够想出以下所有表达式 -
subrule ::= phrase
::= OR_phrase
::= JOIN_phrase
::= BEFORE_phrase
::= AFTER_phrase
::= AND_phrase
::= phrase + upto_N_words + phrase
::= braced_OR_phrase + phrase
::= phrase + braced_OR_phrase
::= phrase + braced_OR_phrase + phrase
::= phrase + upto_N_words + braced_OR_phrase
::= phrase + upto_N_characters + phrase
::= braced_OR_phrase + phrase + upto_N_words + phrase
::= phrase + braced_OR_phrase + upto_N_words + phrase
举个例子,我有一句话是LINE_CONTAINS the objective of this study was {to identify OR identifying} genes upregulated
。为此,上面提到的子规则是 phrase + braced_OR_phrase + phrase
。
所以我的问题是如何为 subrule
编写一个简单的 BNF 语法表达式,以便我能够使用 Pyparsing 轻松地为其编写语法代码?此外,绝对欢迎任何有关我目前技术的意见。
编辑: 在应用@Paul 在他的回答中阐明的原则后,这里是代码的 MCVE 版本。它需要一个要解析的句子列表 hrrsents
,解析每个句子,将其转换为相应的正则表达式和 returns 正则表达式字符串列表 -
from pyparsing import *
import re
def parse_hrr(hrrsents):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=^)|(?<=[\W_]))" + a + r"(?=[\W_]|$))456",
'PARA_STARTSWITH':
lambda a: ( r"(^" + a + r"(?=[\W_]|$))457") if 'gene' in repr(a)
else (r"(^" + a + r"(?=[\W_]|$))458")}
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
class LineAndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '&&&'.join(t.generate() for t in tokens[::2])
class LineOrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '@@@'.join(t.generate() for t in tokens[::2])
class UpToWordsNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
word_re = r"([\w]+\s*)"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "(%s{0,%d})" % (word_re, op)
return ret
class UpToCharactersNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
char_re = r"\w"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d})" % (char_re, op)
return ret
class BeforeAfterJoinNode(Node):
def generate(self):
tokens = self.tokens[0]
operator_opn_map = {'BEFORE': lambda a,b: a + '.*?' + b, 'AFTER': lambda a,b: b + '.*?' + a, 'JOIN': lambda a,b: a + '[- ]?' + b}
ret = tokens[0].generate()
for operator, operand in zip(tokens[1::2], tokens[2::2]):
ret = operator_opn_map[operator](ret, operand.generate()) # this is basically calling a dict element, and every such element requires 2 variables (a&b), so providing them as ret and op.generate
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
uptowords_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE).setParseAction(UpToWordsNode)
uptochars_expr = Group(LBRACE + UPTO + integer("numberofchars") + CHARACTERS + RBRACE).setParseAction(UpToCharactersNode)
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words | uptowords_expr | uptochars_expr
phrase_expr = infixNotation(phrase_item,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT, BeforeAfterJoinNode), # was not working earlier, because BEFORE etc. were not keywords, and hence parsed as words
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
#
line_contents_expr = infixNotation(line_term.setParseAction(LineTermNode),
[(AND, 2, opAssoc.LEFT, LineAndNode),
(OR, 2, opAssoc.LEFT, LineOrNode),
]
) # grammar for the entire rule/sentence
######################################
mrrlist=[]
for t in hrrsents:
t = t.strip()
if not t:
continue
try:
parsed = line_contents_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
temp_regex = parsed[0].generate()
final_regexes3 = re.sub(r'gene','%s',temp_regex) # this can be made more precise by putting a condition of [non-word/^/$] around the 'gene'
mrrlist.append(final_regexes3)
return(mrrlist)
你这里有一个两层语法,所以你最好一次只关注一层,我们已经在你的其他一些问题中提到了这一点。下层是 phrase_expr
的层,稍后将作为 line_directive_expr
的参数。因此,首先定义短语表达式的示例 - 从完整语句示例列表中提取它们。 phrase_expr
的最终 BNF 将具有最低级别的递归,如下所示:
phrase_atom ::= <one or more types of terminal items, like words of characters
or quoted strings, or *possibly* expressions of numbers of
words or characters> | brace + phrase_expr + brace`
(其他一些问题:是否可以有多个phrase_items一个接一个没有运算符?这说明什么?应该如何解析?解释?这个隐含的操作是否应该是它自己的级别优先级?)
这足以循环回您的短语表达式的递归 - 您的 BNF 中不需要任何其他 braced_xxx
元素。 AND、OR 和 JOIN 显然是二元运算符 - 在正常操作优先级中,AND 在 OR 之前求值,您可以自己决定 JOIN 应该落在哪个位置。写一些不带括号的示例短语,使用 AND 和 JOIN,以及 OR 和 JOIN,并思考在您的域中什么样的评估顺序是有意义的。
完成后,line_directive_expr
应该很简单,因为它只是:
line_directive_item ::= line_directive phrase_expr | brace line_directive_expr brace
line_directive_and ::= line_directive_item (AND line_directive_item)*
line_directive_or ::= line_directive_and (OR line_directive_and)*
line_directive_expr ::= line_directive_or
然后当您翻译成 pyparsing 时,添加组和结果名称一次一点!不要立即将所有内容分组或命名所有内容。通常我建议自由地使用结果名称,但在中缀符号语法中,大量的结果名称只会使结果变得混乱。让组(最终是节点 classes)进行结构化,节点 classes 中的行为将引导您将结果命名到您想要的位置。就此而言,结果 classes 通常具有如此简单的结构,以至于通常更容易在 class init 或 evaluate 方法中进行列表解包。 从简单的表达式开始,逐步扩展到复杂的表达式。(看看 "LINE_STARTSWITH gene"
- 它是您最简单的测试用例之一,但您将它作为 #97?)如果您只需按长度顺序对列表进行排序,这将是一个很好的粗略剪辑。或按增加的运算符数量排序。但是,在让简单的案例开始工作之前先处理复杂的案例,对于应该在哪里进行调整或改进,你会有太多的选择,而且(从个人经验来看)你做错的可能性和做对的可能性一样大——除非你错了,这只会让解决下一个问题变得更加困难。
而且,正如我们在其他地方讨论过的那样,第二层中的魔鬼正在对各种行指令项进行实际解释,因为存在评估 LINE_STARTSWITH 与 [=36= 的隐含顺序] 覆盖它们可能在初始字符串中找到的顺序。这个球完全在你的球场上,因为你是这个特定领域的语言设计师。