NLTK CFG ValueError: Grammar does not cover some of the input words

NLTK CFG ValueError: Grammar does not cover some of the input words

我正在使用 nltk.ChartParser(grammar) 处理文本并收到标题中所述的错误消息。

我不明白为什么,因为我句子中的所有单词都包含在语法中,正如您在我的代码中看到的那样:

1.步骤:预处理(无错误)

message = "The burglar robbed the bank"

import nltk
    
def preprocess(text):
    sentences = nltk.sent_tokenize(text)                     # sentence segmentation
    sentences = [nltk.word_tokenize(s) for s in sentences]   # word tokenization
    sentences = [nltk.pos_tag(s) for s in sentences]         # part-of-speech tagger
    return sentences

preprocessed = preprocess(message)

print(preprocessed) # >>>> [[('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')]]

至此,我已经对句子进行了预处理,可以定义我的语法了。它涵盖了例句中的所有单词,如您所见:

2。步骤:定义grammer (无错误)

grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")

但是执行实际的解析会导致错误:

3。步骤:解析

parser = nltk.ChartParser(grammar)

for sentence in preprocessed:
    for tree in parser.parse(sentence):
        print(tree)

# >>>> ValueError: Grammar does not cover some of the input words: "('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')".

我不明白为什么会出现这个错误。语法上写的很清楚。

看起来将令牌转换为 nltk.pos_tag 不太正确。 评论该行并且脚本有效:

import nltk

message = "The burglar robbed the bank"

#----------------------------------------------------------
# Preprocessing
#----------------------------------------------------------
def preprocess(text):
    sentences = nltk.sent_tokenize(text)                     # sentence segmentation
    sentences = [nltk.word_tokenize(s) for s in sentences]   # word tokenization
    # THIS LINE SEEMS TO BE THE ISSUE
    # sentences = [nltk.pos_tag(s) for s in sentences]         # part-of-speech tagger
    return sentences

preprocessed = preprocess(message)

#----------------------------------------------------------
# Define grammer
#----------------------------------------------------------
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")

#----------------------------------------------------------
# Parsing
#----------------------------------------------------------
parser = nltk.ChartParser(grammar)

for sentence in preprocessed:
    for tree in parser.parse(sentence):
        print(tree)

输出:

(S
  (NP (DT The) (NN burglar))
  (VP (VBD robbed) (NP (DT the) (NN bank))))