我怎样才能让 nltk.NaiveBayesClassifier.train() 与我的字典一起工作
How can I make nltk.NaiveBayesClassifier.train() work with my dictionary
我目前正在使用 Naive Bayles 制作简单 spam/ham 电子邮件过滤器。
为了让您理解我的算法逻辑:我有一个包含很多 os 个文件的文件夹,这些文件是 spam/ham 个电子邮件的示例。我在这个文件夹中还有另外两个文件,其中包含我所有的垃圾邮件示例的标题,另一个文件包含我所有的垃圾邮件示例的标题。我是这样组织的,所以我可以正确打开和阅读这些电子邮件。
我将我认为重要的所有单词放入字典结构中,并根据我从哪种文件中提取它们来标记 "spam" 或 "ham"。
然后我使用 nltk.NaiveBayesClassifier.train() 来训练我的分类器,但我收到错误:
对于特征集,labeled_featuresets中的标签:
ValueError:要解压的值太多
我不知道为什么会这样。当我寻找解决方案时,我发现字符串不可散列,我正在使用一个列表来做到这一点,然后我把它变成了一个字典,据我所知它是可散列的,但它不断出现这个错误。
有人知道如何解决吗?谢谢!
下面列出了我所有的代码:
import nltk
import re
import random
stopwords = nltk.corpus.stopwords.words('english') #Words I should avoid since they have weak value for classification
my_file = open("spam_files.txt", "r") #my_file now has the name of each file that contains a spam email example
word = {} #a dictionary where I will storage all the words and which value they have (spam or ham)
for lines in my_file: #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
for phrase in email: #After that, I will take every phrase of this email example I just opened
try: #and I'll try to tokenize it
tokens = nltk.word_tokenize(phrase)
except:
continue #I will ignore non-ascii elements
for c in tokens: #for each token
regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
c = regex.sub('', c)
if (c): #If there is any element left
if (c not in stopwords): #And if this element is a not a stopword
c.lower()
word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".
my_file.close()
email.close()
#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
for phrase in email:
tokens = nltk.word_tokenize(phrase)
for c in tokens:
regex = re.compile('[^a-zA-Z]')
c = regex.sub('', c)
if (c):
if (c not in stopwords):
c.lower()
word.update({c: 'ham'})
my_file.close()
email.close()
#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)
nltk.NaiveBayesClassifier.train()
期望“元组列表 (featureset, label)
”(参见 train()
方法的文档)
没有提到的是 featureset
应该是映射到特征值的特征名称的字典。
因此,在使用词袋模型的典型 spam/ham 分类中,标签为 'spam'/'ham' 或 1/0 或 True/False;
特征名称是出现的单词,值是每个单词出现的次数。
例如,train()
方法的参数可能如下所示:
[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
({'money': 3}, 'spam'),
...
({'dear': 1, 'meeting': 2}, 'ham'),
...
]
如果您的数据集相当小,您可能希望将实际字数替换为 1,以减少数据稀疏性。
我目前正在使用 Naive Bayles 制作简单 spam/ham 电子邮件过滤器。
为了让您理解我的算法逻辑:我有一个包含很多 os 个文件的文件夹,这些文件是 spam/ham 个电子邮件的示例。我在这个文件夹中还有另外两个文件,其中包含我所有的垃圾邮件示例的标题,另一个文件包含我所有的垃圾邮件示例的标题。我是这样组织的,所以我可以正确打开和阅读这些电子邮件。
我将我认为重要的所有单词放入字典结构中,并根据我从哪种文件中提取它们来标记 "spam" 或 "ham"。
然后我使用 nltk.NaiveBayesClassifier.train() 来训练我的分类器,但我收到错误:
对于特征集,labeled_featuresets中的标签: ValueError:要解压的值太多
我不知道为什么会这样。当我寻找解决方案时,我发现字符串不可散列,我正在使用一个列表来做到这一点,然后我把它变成了一个字典,据我所知它是可散列的,但它不断出现这个错误。 有人知道如何解决吗?谢谢!
下面列出了我所有的代码:
import nltk
import re
import random
stopwords = nltk.corpus.stopwords.words('english') #Words I should avoid since they have weak value for classification
my_file = open("spam_files.txt", "r") #my_file now has the name of each file that contains a spam email example
word = {} #a dictionary where I will storage all the words and which value they have (spam or ham)
for lines in my_file: #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
for phrase in email: #After that, I will take every phrase of this email example I just opened
try: #and I'll try to tokenize it
tokens = nltk.word_tokenize(phrase)
except:
continue #I will ignore non-ascii elements
for c in tokens: #for each token
regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
c = regex.sub('', c)
if (c): #If there is any element left
if (c not in stopwords): #And if this element is a not a stopword
c.lower()
word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".
my_file.close()
email.close()
#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
for phrase in email:
tokens = nltk.word_tokenize(phrase)
for c in tokens:
regex = re.compile('[^a-zA-Z]')
c = regex.sub('', c)
if (c):
if (c not in stopwords):
c.lower()
word.update({c: 'ham'})
my_file.close()
email.close()
#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)
nltk.NaiveBayesClassifier.train()
期望“元组列表 (featureset, label)
”(参见 train()
方法的文档)
没有提到的是 featureset
应该是映射到特征值的特征名称的字典。
因此,在使用词袋模型的典型 spam/ham 分类中,标签为 'spam'/'ham' 或 1/0 或 True/False;
特征名称是出现的单词,值是每个单词出现的次数。
例如,train()
方法的参数可能如下所示:
[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
({'money': 3}, 'spam'),
...
({'dear': 1, 'meeting': 2}, 'ham'),
...
]
如果您的数据集相当小,您可能希望将实际字数替换为 1,以减少数据稀疏性。