NLTK CFG ValueError: Grammar does not cover some of the input words

Question

我正在使用 nltk.ChartParser(grammar) 处理文本并收到标题中所述的错误消息。

我不明白为什么，因为我句子中的所有单词都包含在语法中，正如您在我的代码中看到的那样：

1.步骤：预处理（无错误）

message = "The burglar robbed the bank"

import nltk
    
def preprocess(text):
    sentences = nltk.sent_tokenize(text)                     # sentence segmentation
    sentences = [nltk.word_tokenize(s) for s in sentences]   # word tokenization
    sentences = [nltk.pos_tag(s) for s in sentences]         # part-of-speech tagger
    return sentences

preprocessed = preprocess(message)

print(preprocessed) # >>>> [[('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')]]

至此，我已经对句子进行了预处理，可以定义我的语法了。它涵盖了例句中的所有单词，如您所见：

2。步骤：定义grammer （无错误）

grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")

但是执行实际的解析会导致错误：

3。步骤：解析

parser = nltk.ChartParser(grammar)

for sentence in preprocessed:
    for tree in parser.parse(sentence):
        print(tree)

# >>>> ValueError: Grammar does not cover some of the input words: "('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')".

我不明白为什么会出现这个错误。语法上写的很清楚。

Answer 1

看起来将令牌转换为 nltk.pos_tag 不太正确。评论该行并且脚本有效：

import nltk

message = "The burglar robbed the bank"

#----------------------------------------------------------
# Preprocessing
#----------------------------------------------------------
def preprocess(text):
    sentences = nltk.sent_tokenize(text)                     # sentence segmentation
    sentences = [nltk.word_tokenize(s) for s in sentences]   # word tokenization
    # THIS LINE SEEMS TO BE THE ISSUE
    # sentences = [nltk.pos_tag(s) for s in sentences]         # part-of-speech tagger
    return sentences

preprocessed = preprocess(message)

#----------------------------------------------------------
# Define grammer
#----------------------------------------------------------
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")

#----------------------------------------------------------
# Parsing
#----------------------------------------------------------
parser = nltk.ChartParser(grammar)

for sentence in preprocessed:
    for tree in parser.parse(sentence):
        print(tree)

输出：

(S
  (NP (DT The) (NN burglar))
  (VP (VBD robbed) (NP (DT the) (NN bank))))

NLTK CFG ValueError: Grammar does not cover some of the input words

NLTK CFG ValueError: Grammar does not cover some of the input words

python

grammar

parsing

nltk

context-free-grammar