朴素贝叶斯分类器和训练数据
Naive Bayes Classifier and training data
我正在使用 nltk 的朴素贝叶斯分类器对一些推文进行情绪分析。我正在使用此处找到的语料库文件训练数据:https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed,以及使用那里的方法。
在创建训练集时,我使用了数据集中的所有 ~4000 条推文,但我也认为我会用非常少量的 30 条进行测试。
当对整个集合进行测试时,在一组新的推文上使用分类器时,它只会 returns 'neutral' 作为标签,但是当使用 30 时,它只会 return肯定的,这是否意味着我的训练数据不完整或太多 'weighted' 中性条目并且是我的分类器仅 return 在我的训练集中使用 ~4000 条推文时中性的原因?
我在下面包含了我的完整代码。
twitter_api = twitter.Api(consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token_key = access_token,
access_token_secret = access_token_secret)
# Test set builder
def buildtestset(keyword):
try:
min_id = None
tweets = []
ids = []
for i in range(0,50):
tweetsdata = twitter_api.GetSearch(keyword, count = 100, max_id = min_id )
for t in tweetsdata:
tweets.append(t)
ids.append(t.id)
min_id = min(ids)
print(str(len(tweets))+ ' tweets found for keyword: '+keyword)
return[{"text":status.text, "label":None} for status in tweets]
except:
print('this is so sad')
return None
# Quick test
keyword = 'bicycle'
testdataset = buildtestset(keyword)
# Training set builder
def buildtrainingset(corpusfile,tweetdata):
#corpusfile = pathway to corpus data
#tweetdata = pathway to file we going to save all the tweets to
corpus = []
with open(corpusfile,'r') as csvfile:
linereader = csv.reader(csvfile, delimiter = ',', quotechar = "\"")
for row in linereader:
corpus.append({'tweet_id':row[2],'label':row[1],'topic':row[0]})
# Append every tweet from corpusfile to our corpus list
rate_limit = 180
sleep_time = 900/180
# these are set up so we call enough times to be within twitters guidelines
# the rest is calling the api of every tweet to get the status object, text associated with it and then put it in our
# data set - trainingdata
trainingdata = []
count = 0
for tweet in corpus:
if count < 30:
try:
status = twitter_api.GetStatus(tweet['tweet_id'])
print ('Tweet fetched '+status.text)
tweet['text'] = status.text
trainingdata.append(tweet)
time.sleep(sleep_time)
count += 1
except:
count += 1
continue
#write tweets to empty csv
with open(tweetdata,'w',encoding='utf-8') as csvfile:
linewriter = csv.writer(csvfile, delimiter=',',quotechar = "\"")
for tweet in trainingdata:
try:
linewriter.writerow([tweet['tweet_id'],tweet['text'],tweet['label'],tweet['topic']])
except Exception as e:
print(e)
return trainingdata
corpusfile = (r'C:\Users\zacda\OneDrive\Desktop\DATA2901\Assignment\corpusmaster.csv')
tweetdata = (r'C:\Users\zacda\OneDrive\Desktop\DATA2901\Assignment\tweetdata.csv')
TrainingData = buildtrainingset(corpusfile,tweetdata)
import re # regular expression library
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
class preprocesstweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
def processtweets(self, list_of_tweets):
processedtweets=[]
for tweet in list_of_tweets:
processedtweets.append((self._processtweet(tweet["text"]),tweet["label"]))
return processedtweets
def _processtweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetprocessor = preprocesstweets()
processedtrainingdata = tweetprocessor.processtweets(TrainingData)
processedtestdata = tweetprocessor.processtweets(testdataset)
# This is a list of all the words we have in the training set, the word_features is a list of all the distinct words w freq
import nltk
def buildvocab(processedtrainingdata):
all_words = []
for (words, sentiment) in processedtrainingdata:
all_words.extend(words)
wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()
return word_features
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in tweet_words) #creates json key containing word x, its loc.
# Every key has a T/F according - true for present , false for not
return features
# Building the feature vector
word_features = buildvocab(processedtrainingdata)
training_features = nltk.classify.apply_features(extract_features, processedtrainingdata)
# apply features does the actual extraction
# Naive Bayes Classifier
Nbayes = nltk.NaiveBayesClassifier.train(training_features)
Nbayes_result_labels = [Nbayes.classify(extract_features(tweet[0])) for tweet in processedtestdata]
# get the majority vote [?]
if Nbayes_result_labels.count('positive') > Nbayes_result_labels.count('negative'):
print('Positive')
print(str(100*Nbayes_result_labels.count('positive')/len(Nbayes_result_labels)))
elif Nbayes_result_labels.count('negative') > Nbayes_result_labels.count('positive'):
print(str(100*Nbayes_result_labels.count('negative')/len(Nbayes_result_labels)))
print('Negative sentiment')
else:
print('Neutral')
在进行机器学习时,我们希望学习一种在新的(未见过的)数据上表现良好的算法。这称为泛化。
测试集的目的之一是验证 classifier 的泛化行为。如果您的模型为每个测试实例预测相同的标签,那么我们无法确认该假设。测试集应能代表您稍后应用它的条件。
根据经验,我认为您保留 50-25% 的数据作为测试集。这当然要视情况而定。 30/4000 不到百分之一。
想到的第二点是,当您的 classifier 偏向一个 class 时,请确保每个 class 在训练和验证集中几乎均等地表示.这可以防止 classifier 'just' 学习整个集合的分布,而不是学习哪些特征是相关的。
最后一点,通常我们会报告精度、召回率和 Fβ=1 等指标来评估我们的 classifier。您示例中的代码似乎根据所有推文中的全球情绪报告了一些内容,您确定那是您想要的吗?这些推文是否具有代表性?
我正在使用 nltk 的朴素贝叶斯分类器对一些推文进行情绪分析。我正在使用此处找到的语料库文件训练数据:https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed,以及使用那里的方法。
在创建训练集时,我使用了数据集中的所有 ~4000 条推文,但我也认为我会用非常少量的 30 条进行测试。
当对整个集合进行测试时,在一组新的推文上使用分类器时,它只会 returns 'neutral' 作为标签,但是当使用 30 时,它只会 return肯定的,这是否意味着我的训练数据不完整或太多 'weighted' 中性条目并且是我的分类器仅 return 在我的训练集中使用 ~4000 条推文时中性的原因?
我在下面包含了我的完整代码。
twitter_api = twitter.Api(consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token_key = access_token,
access_token_secret = access_token_secret)
# Test set builder
def buildtestset(keyword):
try:
min_id = None
tweets = []
ids = []
for i in range(0,50):
tweetsdata = twitter_api.GetSearch(keyword, count = 100, max_id = min_id )
for t in tweetsdata:
tweets.append(t)
ids.append(t.id)
min_id = min(ids)
print(str(len(tweets))+ ' tweets found for keyword: '+keyword)
return[{"text":status.text, "label":None} for status in tweets]
except:
print('this is so sad')
return None
# Quick test
keyword = 'bicycle'
testdataset = buildtestset(keyword)
# Training set builder
def buildtrainingset(corpusfile,tweetdata):
#corpusfile = pathway to corpus data
#tweetdata = pathway to file we going to save all the tweets to
corpus = []
with open(corpusfile,'r') as csvfile:
linereader = csv.reader(csvfile, delimiter = ',', quotechar = "\"")
for row in linereader:
corpus.append({'tweet_id':row[2],'label':row[1],'topic':row[0]})
# Append every tweet from corpusfile to our corpus list
rate_limit = 180
sleep_time = 900/180
# these are set up so we call enough times to be within twitters guidelines
# the rest is calling the api of every tweet to get the status object, text associated with it and then put it in our
# data set - trainingdata
trainingdata = []
count = 0
for tweet in corpus:
if count < 30:
try:
status = twitter_api.GetStatus(tweet['tweet_id'])
print ('Tweet fetched '+status.text)
tweet['text'] = status.text
trainingdata.append(tweet)
time.sleep(sleep_time)
count += 1
except:
count += 1
continue
#write tweets to empty csv
with open(tweetdata,'w',encoding='utf-8') as csvfile:
linewriter = csv.writer(csvfile, delimiter=',',quotechar = "\"")
for tweet in trainingdata:
try:
linewriter.writerow([tweet['tweet_id'],tweet['text'],tweet['label'],tweet['topic']])
except Exception as e:
print(e)
return trainingdata
corpusfile = (r'C:\Users\zacda\OneDrive\Desktop\DATA2901\Assignment\corpusmaster.csv')
tweetdata = (r'C:\Users\zacda\OneDrive\Desktop\DATA2901\Assignment\tweetdata.csv')
TrainingData = buildtrainingset(corpusfile,tweetdata)
import re # regular expression library
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
class preprocesstweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
def processtweets(self, list_of_tweets):
processedtweets=[]
for tweet in list_of_tweets:
processedtweets.append((self._processtweet(tweet["text"]),tweet["label"]))
return processedtweets
def _processtweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetprocessor = preprocesstweets()
processedtrainingdata = tweetprocessor.processtweets(TrainingData)
processedtestdata = tweetprocessor.processtweets(testdataset)
# This is a list of all the words we have in the training set, the word_features is a list of all the distinct words w freq
import nltk
def buildvocab(processedtrainingdata):
all_words = []
for (words, sentiment) in processedtrainingdata:
all_words.extend(words)
wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()
return word_features
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in tweet_words) #creates json key containing word x, its loc.
# Every key has a T/F according - true for present , false for not
return features
# Building the feature vector
word_features = buildvocab(processedtrainingdata)
training_features = nltk.classify.apply_features(extract_features, processedtrainingdata)
# apply features does the actual extraction
# Naive Bayes Classifier
Nbayes = nltk.NaiveBayesClassifier.train(training_features)
Nbayes_result_labels = [Nbayes.classify(extract_features(tweet[0])) for tweet in processedtestdata]
# get the majority vote [?]
if Nbayes_result_labels.count('positive') > Nbayes_result_labels.count('negative'):
print('Positive')
print(str(100*Nbayes_result_labels.count('positive')/len(Nbayes_result_labels)))
elif Nbayes_result_labels.count('negative') > Nbayes_result_labels.count('positive'):
print(str(100*Nbayes_result_labels.count('negative')/len(Nbayes_result_labels)))
print('Negative sentiment')
else:
print('Neutral')
在进行机器学习时,我们希望学习一种在新的(未见过的)数据上表现良好的算法。这称为泛化。
测试集的目的之一是验证 classifier 的泛化行为。如果您的模型为每个测试实例预测相同的标签,那么我们无法确认该假设。测试集应能代表您稍后应用它的条件。
根据经验,我认为您保留 50-25% 的数据作为测试集。这当然要视情况而定。 30/4000 不到百分之一。
想到的第二点是,当您的 classifier 偏向一个 class 时,请确保每个 class 在训练和验证集中几乎均等地表示.这可以防止 classifier 'just' 学习整个集合的分布,而不是学习哪些特征是相关的。
最后一点,通常我们会报告精度、召回率和 Fβ=1 等指标来评估我们的 classifier。您示例中的代码似乎根据所有推文中的全球情绪报告了一些内容,您确定那是您想要的吗?这些推文是否具有代表性?