TextBlob NaiveBayesAnalyzer 极慢（与 Pattern 相比）

Question

我正在使用 python 的 TextBlob 对推文进行一些情绪分析。 TextBlob 中的默认分析器是 PatternAnalyzer，它工作得相当好并且速度相当快。

sent = TextBlob(tweet.decode('utf-8')).sentiment

我现在尝试切换到 NaiveBayesAnalyzer，发现运行时间对我的需求来说不切实际。（每条推文接近 5 秒。）

sent = TextBlob(tweet.decode('utf-8'), analyzer=NaiveBayesAnalyzer()).sentiment

我之前使用过朴素贝叶斯分类器的 scikit 学习实现，并没有发现它这么慢，所以我想知道我在这种情况下是否正确使用它。

我假设分析器是预训练的，至少 the documentation 声明 "Naive Bayes analyzer that is trained on a dataset of movie reviews." 但是它还有一个函数 train() 被描述为 "Train the Naive Bayes classifier on the movie review corpus." 它是否在内部训练每个运行之前的分析器？我希望不会。

有谁知道加快速度的方法吗？

Answer 1

是的，Textblob 会在每个运行之前训练分析器。您可以使用以下代码来避免每次都训练分析器。

from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
tb = Blobber(analyzer=NaiveBayesAnalyzer())

print tb("sentence you want to test")

Answer 2

如果您在数据框中有 table 数据并且想使用 textblob 的 NaiveBayesAnalyzer，那么添加到 Alan 的非常有用的答案中是可行的。只需为您的相关字符串系列更改 word_list。

import textblob
import pandas as pd

tb = textblob.Blobber(analyzer=NaiveBayesAnalyzer())
for index, row in df.iterrows():
    sent = tb(row['word_list']).sentiment
    df.loc[index, 'classification'] = sent[0]
    df.loc[index, 'p_pos'] = sent[1]
    df.loc[index, 'p_neg'] = sent[2]

上面将 sentiment returns 的元组拆分为三个独立的系列。

如果系列都是字符串，但如果它具有混合数据类型，这会起作用，因为 pandas 中的 object 数据类型可能会出现问题，那么您可能需要放置一个 try/except 阻止它以捕获异常。

在我的测试中，它按时在大约 4.7 秒内执行了 1000 行。

希望这对您有所帮助。

TextBlob NaiveBayesAnalyzer 极慢（与 Pattern 相比）

TextBlob NaiveBayesAnalyzer extremely slow (compared to Pattern)

python

textblob

naivebayes