朴素贝叶斯分类器——空词汇
Naive Bayes classifier - empty vocabulary
我正在尝试使用朴素贝叶斯来检测文本中的幽默。我从 here 中获取了这段代码,但我有一些错误,我不知道如何解决它们,因为我对机器学习和这些算法还很陌生。我的火车数据包含一行。我知道其他人提出了同样的问题,但我还没有找到答案。
import os
import io
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_jokes', 'funny'))
data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes', 'notfunny'))
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)
examples = ['Where do steers go to dance? The Meat Ball', 'tomorrow I press this button']
examples_counts = vectorizer.transform(examples)
predictions = classifier.predict(examples_counts)
print(predictions)
错误:
Traceback (most recent call last):
File "G:/PyCharmProjects/naive_bayes_classifier/NaiveBayesClassifier.py", line 55, in <module>
counts = vectorizer.fit_transform(data['message'].values)
File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 811, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
以下是来自 train_jokes
的一些输入
"[me narrating a documentary about narrators] ""I can't hear what they're saying cuz I'm talking"""
"Telling my daughter garlic is good for you. Good immune system and keeps pests away.Ticks, mosquitos, vampires... men."
I've been going through a really rough period at work this week It's my own fault for swapping my tampax for sand paper.
"If I could have dinner with anyone, dead or alive... ...I would choose alive. -B.J. Novak-"
Two guys walk into a bar. The third guy ducks.
Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo
Why was the musician arrested? He got in treble.
Did you hear about the guy who blew his entire lottery winnings on a limousine? He had nothing left to chauffeur it.
What do you do if a bird shits on your car? Don't ask her out again.
He was a real gentlemen and always opened the fridge door for me
train_jokes
包含大约 250.000 条单行或推文,train_non_jokes
包含简单的句子,并不好笑。目前我还没有准备好无趣的文件,只有来自 Twitter 的一些句子。
问题不在于代码,而在于列车数据。首先,G:/PyCharmProjects/naive_bayes_classifier/train_jokes
和 G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes
必须是包含带有训练数据的文件的目录的路径(因此 train_jokes 和 train_non_jokes 是目录)。另一方面,我的文件不包含新行,因此变量 inBody
始终为 false。为了使程序正常运行,列车数据需要如下所示:
text here and then blank line
another text
and this is it
(我刚刚删除了 inBody
的引用,这解决了新行)。这些是我看那个视频时错过的一些细节,因为他没有这么说。谢谢大家的回答,帮了大忙
我正在尝试使用朴素贝叶斯来检测文本中的幽默。我从 here 中获取了这段代码,但我有一些错误,我不知道如何解决它们,因为我对机器学习和这些算法还很陌生。我的火车数据包含一行。我知道其他人提出了同样的问题,但我还没有找到答案。
import os
import io
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_jokes', 'funny'))
data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes', 'notfunny'))
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)
examples = ['Where do steers go to dance? The Meat Ball', 'tomorrow I press this button']
examples_counts = vectorizer.transform(examples)
predictions = classifier.predict(examples_counts)
print(predictions)
错误:
Traceback (most recent call last):
File "G:/PyCharmProjects/naive_bayes_classifier/NaiveBayesClassifier.py", line 55, in <module>
counts = vectorizer.fit_transform(data['message'].values)
File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 811, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
以下是来自 train_jokes
的一些输入"[me narrating a documentary about narrators] ""I can't hear what they're saying cuz I'm talking"""
"Telling my daughter garlic is good for you. Good immune system and keeps pests away.Ticks, mosquitos, vampires... men."
I've been going through a really rough period at work this week It's my own fault for swapping my tampax for sand paper.
"If I could have dinner with anyone, dead or alive... ...I would choose alive. -B.J. Novak-"
Two guys walk into a bar. The third guy ducks.
Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo
Why was the musician arrested? He got in treble.
Did you hear about the guy who blew his entire lottery winnings on a limousine? He had nothing left to chauffeur it.
What do you do if a bird shits on your car? Don't ask her out again.
He was a real gentlemen and always opened the fridge door for me
train_jokes
包含大约 250.000 条单行或推文,train_non_jokes
包含简单的句子,并不好笑。目前我还没有准备好无趣的文件,只有来自 Twitter 的一些句子。
问题不在于代码,而在于列车数据。首先,G:/PyCharmProjects/naive_bayes_classifier/train_jokes
和 G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes
必须是包含带有训练数据的文件的目录的路径(因此 train_jokes 和 train_non_jokes 是目录)。另一方面,我的文件不包含新行,因此变量 inBody
始终为 false。为了使程序正常运行,列车数据需要如下所示:
text here and then blank line
another text
and this is it
(我刚刚删除了 inBody
的引用,这解决了新行)。这些是我看那个视频时错过的一些细节,因为他没有这么说。谢谢大家的回答,帮了大忙