如何匹配 NLTK CFG 中的整数?

How to match integers in NLTK CFG?

如果我想定义一个语法,其中一个标记将匹配一个整数,我如何使用 nltk 的字符串 CFG 来实现它?

例如-

S -> SK SO FK
SK -> 'SELECT'
SO -> '\d+'
FK -> 'FROM'

创建一个数字短语:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
""")

sent = 'I shot 3 elephants'.split()
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[输出]:

(S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))

但请注意,它只能处理个位数。因此,让我们尝试将整数压缩为单个标记类型,例如'#NUM#':

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in sent]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[输出]:

(S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))

要放回数字,请尝试:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

original_sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
numbers = [i for i in original_sent if i.isdigit()]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    treestr = str(tree)
    for n in numbers:
        treestr = treestr.replace('#NUM#', n, 1)
    print(treestr)

[输出]:

(S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))

一个简单的解决方案是定义一个函数,该函数在给定句子和语法的情况下创建解析器。这通过增加每个函数调用的语法以包括句子中整数的产生式来解决整数问题。这是一个示例函数:

def name_parser(G,sent):
    ints = [i for i in sent if i.isdigit()]
    lproductions = list(G.productions())
    lproduction.extend([nltk.grammar.Production(nltk.grammar.Nonterminal('INT'),[i]) for i in ints])
    lgrammar = nltk.grammar.CFG(G.start(),lproductions)
    parser = nltk.ChartParser(lgrammar)
    for tree in parser.parse(sent):
        print(tree)