用 nltk 训练自己的分类器后，如何将其加载到 textblob 中？

Question

textblob 中的内置分类器非常笨。它根据电影评论进行训练，所以我在我的上下文中创建了大量示例（57,000 个故事，分类为正面或负面）然后使用 nltk. 对其进行训练我尝试使用 textblob 对其进行训练但它总是失败：

with open('train.json', 'r') as fp:
    cl = NaiveBayesClassifier(fp, format="json")

那会运行几个小时，并以内存错误结束。

我查看了源代码，发现它只是使用了 nltk 并对其进行了包装，所以我改用了它，并且它起作用了。

nltk 训练集的结构需要是一个元组列表，第一部分是文本中单词的计数器和出现频率。元组的第二部分是 'pos' 或 'neg' 表示情绪。

>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later

>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using

>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)

然后腌制了

with open('storybayes.pickle','wb') as f:
    pickle.dump(cl,f)

现在...我拿了这个腌制文件，重新打开它以获取 nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- 并尝试将其输入 textblob。而不是

from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())

我试过了：

blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
  File "<pyshell#116>", line 1, in <module>
    blob = TextBlob("I love this library", analyzer=cl4)
  File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
    parser, classifier)
  File "C:\python\lib\site-packages\textblob\blob.py", line 323, in 
_initialize_models
    BaseSentimentAnalyzer, BaseBlob.analyzer)
  File "C:\python\lib\site-packages\textblob\blob.py", line 305, in 
_validated_param
    .format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer

现在怎么办？我查看了源代码，两者都是类，但并不完全相同。

Answer 1

查看错误消息，分析器似乎必须继承自抽象class BaseSentimentAnalyzer。如文档 here, this class must implement the analyze(text) function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI here 中所述。因此，我相信这两种实现不能结合，除非您可以在 NLTK 的实现中实现一个新的 analyze 函数以使其与 TextBlob 的实现兼容。

Answer 2

我无法确定 nltk 语料库不能与 textblob 一起使用，这会让我感到惊讶，因为 textblob 在其源代码中导入了所有 nltk 函数，并且基本上是一个包装器。

但经过数小时的测试后我得出的结论是，nltk 提供了一个更好的内置情感语料库，称为 "vader"，它优于我所有训练过的模型。

import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE

vader_lexicon 和 nltk 代码对句子中的否定语言进行了更多的解析，以否定肯定的词。就像达斯·维德 (Darth Vader) 说 "lack of faith" 时，情绪会发生相反的变化。

我在这里解释了，并附有更好结果的例子： https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/

替换此 textblob 实现：

from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE

vader nltk 分类器也有关于使用它进行情绪分析的额外文档：http://www.nltk.org/howto/sentiment.html

textBlob 总是让我的电脑崩溃，只有 5000 个例子。

Answer 3

另一种更具前瞻性的解决方案是使用spaCy来构建模型，而不是textblob或nltk。这对我来说是新的，但似乎更易于使用且功能更强大： https://spacy.io/usage/spacy-101#section-lightning-tour

"spaCy is the Ruby of Rails of natural language processing."

import spacy
import random

nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through  million", {'entities': [(0, 4, 'ORG')]})] # better model stuff

with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(train_data)
        for text, annotations in train_data:
            nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')

用 nltk 训练自己的分类器后，如何将其加载到 textblob 中？

After training my own classifier with nltk, how do I load it in textblob?

python

nltk

textblob

naivebayes