NLTK CFG ValueError: Grammar does not cover some of the input words
NLTK CFG ValueError: Grammar does not cover some of the input words
我正在使用 nltk.ChartParser(grammar)
处理文本并收到标题中所述的错误消息。
我不明白为什么,因为我句子中的所有单词都包含在语法中,正如您在我的代码中看到的那样:
1.步骤:预处理(无错误)
message = "The burglar robbed the bank"
import nltk
def preprocess(text):
sentences = nltk.sent_tokenize(text) # sentence segmentation
sentences = [nltk.word_tokenize(s) for s in sentences] # word tokenization
sentences = [nltk.pos_tag(s) for s in sentences] # part-of-speech tagger
return sentences
preprocessed = preprocess(message)
print(preprocessed) # >>>> [[('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')]]
至此,我已经对句子进行了预处理,可以定义我的语法了。它涵盖了例句中的所有单词,如您所见:
2。步骤:定义grammer (无错误)
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")
但是执行实际的解析会导致错误:
3。步骤:解析
parser = nltk.ChartParser(grammar)
for sentence in preprocessed:
for tree in parser.parse(sentence):
print(tree)
# >>>> ValueError: Grammar does not cover some of the input words: "('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')".
我不明白为什么会出现这个错误。语法上写的很清楚。
看起来将令牌转换为 nltk.pos_tag 不太正确。
评论该行并且脚本有效:
import nltk
message = "The burglar robbed the bank"
#----------------------------------------------------------
# Preprocessing
#----------------------------------------------------------
def preprocess(text):
sentences = nltk.sent_tokenize(text) # sentence segmentation
sentences = [nltk.word_tokenize(s) for s in sentences] # word tokenization
# THIS LINE SEEMS TO BE THE ISSUE
# sentences = [nltk.pos_tag(s) for s in sentences] # part-of-speech tagger
return sentences
preprocessed = preprocess(message)
#----------------------------------------------------------
# Define grammer
#----------------------------------------------------------
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")
#----------------------------------------------------------
# Parsing
#----------------------------------------------------------
parser = nltk.ChartParser(grammar)
for sentence in preprocessed:
for tree in parser.parse(sentence):
print(tree)
输出:
(S
(NP (DT The) (NN burglar))
(VP (VBD robbed) (NP (DT the) (NN bank))))
我正在使用 nltk.ChartParser(grammar)
处理文本并收到标题中所述的错误消息。
我不明白为什么,因为我句子中的所有单词都包含在语法中,正如您在我的代码中看到的那样:
1.步骤:预处理(无错误)
message = "The burglar robbed the bank"
import nltk
def preprocess(text):
sentences = nltk.sent_tokenize(text) # sentence segmentation
sentences = [nltk.word_tokenize(s) for s in sentences] # word tokenization
sentences = [nltk.pos_tag(s) for s in sentences] # part-of-speech tagger
return sentences
preprocessed = preprocess(message)
print(preprocessed) # >>>> [[('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')]]
至此,我已经对句子进行了预处理,可以定义我的语法了。它涵盖了例句中的所有单词,如您所见:
2。步骤:定义grammer (无错误)
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")
但是执行实际的解析会导致错误:
3。步骤:解析
parser = nltk.ChartParser(grammar)
for sentence in preprocessed:
for tree in parser.parse(sentence):
print(tree)
# >>>> ValueError: Grammar does not cover some of the input words: "('The', 'DT'), ('burglar', 'NN'), ('robbed', 'VBD'), ('the', 'DT'), ('bank', 'NN')".
我不明白为什么会出现这个错误。语法上写的很清楚。
看起来将令牌转换为 nltk.pos_tag 不太正确。 评论该行并且脚本有效:
import nltk
message = "The burglar robbed the bank"
#----------------------------------------------------------
# Preprocessing
#----------------------------------------------------------
def preprocess(text):
sentences = nltk.sent_tokenize(text) # sentence segmentation
sentences = [nltk.word_tokenize(s) for s in sentences] # word tokenization
# THIS LINE SEEMS TO BE THE ISSUE
# sentences = [nltk.pos_tag(s) for s in sentences] # part-of-speech tagger
return sentences
preprocessed = preprocess(message)
#----------------------------------------------------------
# Define grammer
#----------------------------------------------------------
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT NN
VP -> VBD NP
DT -> 'the' | 'The'
NN -> 'burglar' | 'bank'
VBD -> 'robbed'
""")
#----------------------------------------------------------
# Parsing
#----------------------------------------------------------
parser = nltk.ChartParser(grammar)
for sentence in preprocessed:
for tree in parser.parse(sentence):
print(tree)
输出:
(S
(NP (DT The) (NN burglar))
(VP (VBD robbed) (NP (DT the) (NN bank))))