用 nltk 训练自己的分类器后,如何将其加载到 textblob 中?
After training my own classifier with nltk, how do I load it in textblob?
textblob 中的内置分类器非常笨。它根据电影评论进行训练,所以我在我的上下文中创建了大量示例(57,000 个故事,分类为正面或负面)然后使用 nltk.
对其进行训练我尝试使用 textblob 对其进行训练但它总是失败:
with open('train.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")
那会 运行 几个小时,并以内存错误结束。
我查看了源代码,发现它只是使用了 nltk 并对其进行了包装,所以我改用了它,并且它起作用了。
nltk 训练集的结构需要是一个元组列表,第一部分是文本中单词的计数器和出现频率。元组的第二部分是 'pos' 或 'neg' 表示情绪。
>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later
>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using
>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)
然后腌制了
with open('storybayes.pickle','wb') as f:
pickle.dump(cl,f)
现在...我拿了这个腌制文件,重新打开它以获取 nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- 并尝试将其输入 textblob。而不是
from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
我试过了:
blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
File "<pyshell#116>", line 1, in <module>
blob = TextBlob("I love this library", analyzer=cl4)
File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
parser, classifier)
File "C:\python\lib\site-packages\textblob\blob.py", line 323, in
_initialize_models
BaseSentimentAnalyzer, BaseBlob.analyzer)
File "C:\python\lib\site-packages\textblob\blob.py", line 305, in
_validated_param
.format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer
现在怎么办?我查看了源代码,两者都是 类,但并不完全相同。
查看错误消息,分析器似乎必须继承自抽象class BaseSentimentAnalyzer
。如文档 here, this class must implement the analyze(text)
function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI
here 中所述。因此,我相信这两种实现不能结合,除非您可以在 NLTK 的实现中实现一个新的 analyze
函数以使其与 TextBlob 的实现兼容。
我无法确定 nltk 语料库不能与 textblob 一起使用,这会让我感到惊讶,因为 textblob 在其源代码中导入了所有 nltk 函数,并且基本上是一个包装器。
但经过数小时的测试后我得出的结论是,nltk 提供了一个更好的内置情感语料库,称为 "vader"
,它优于我所有训练过的模型。
import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE
vader_lexicon
和 nltk 代码对句子中的否定语言进行了更多的解析,以否定肯定的词。就像达斯·维德 (Darth Vader) 说 "lack of faith" 时,情绪会发生相反的变化。
我在这里解释了,并附有更好结果的例子:
https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/
替换此 textblob 实现:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE
vader nltk
分类器也有关于使用它进行情绪分析的额外文档:http://www.nltk.org/howto/sentiment.html
textBlob 总是让我的电脑崩溃,只有 5000 个例子。
另一种更具前瞻性的解决方案是使用spaCy来构建模型,而不是textblob
或nltk
。这对我来说是新的,但似乎更易于使用且功能更强大:
https://spacy.io/usage/spacy-101#section-lightning-tour
"spaCy is the Ruby of Rails of natural language processing."
import spacy
import random
nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through million", {'entities': [(0, 4, 'ORG')]})] # better model stuff
with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
optimizer = nlp.begin_training()
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
textblob 中的内置分类器非常笨。它根据电影评论进行训练,所以我在我的上下文中创建了大量示例(57,000 个故事,分类为正面或负面)然后使用 nltk.
对其进行训练我尝试使用 textblob 对其进行训练但它总是失败:
with open('train.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")
那会 运行 几个小时,并以内存错误结束。
我查看了源代码,发现它只是使用了 nltk 并对其进行了包装,所以我改用了它,并且它起作用了。
nltk 训练集的结构需要是一个元组列表,第一部分是文本中单词的计数器和出现频率。元组的第二部分是 'pos' 或 'neg' 表示情绪。
>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later
>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using
>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)
然后腌制了
with open('storybayes.pickle','wb') as f:
pickle.dump(cl,f)
现在...我拿了这个腌制文件,重新打开它以获取 nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- 并尝试将其输入 textblob。而不是
from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
我试过了:
blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
File "<pyshell#116>", line 1, in <module>
blob = TextBlob("I love this library", analyzer=cl4)
File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
parser, classifier)
File "C:\python\lib\site-packages\textblob\blob.py", line 323, in
_initialize_models
BaseSentimentAnalyzer, BaseBlob.analyzer)
File "C:\python\lib\site-packages\textblob\blob.py", line 305, in
_validated_param
.format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer
现在怎么办?我查看了源代码,两者都是 类,但并不完全相同。
查看错误消息,分析器似乎必须继承自抽象class BaseSentimentAnalyzer
。如文档 here, this class must implement the analyze(text)
function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI
here 中所述。因此,我相信这两种实现不能结合,除非您可以在 NLTK 的实现中实现一个新的 analyze
函数以使其与 TextBlob 的实现兼容。
我无法确定 nltk 语料库不能与 textblob 一起使用,这会让我感到惊讶,因为 textblob 在其源代码中导入了所有 nltk 函数,并且基本上是一个包装器。
但经过数小时的测试后我得出的结论是,nltk 提供了一个更好的内置情感语料库,称为 "vader"
,它优于我所有训练过的模型。
import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE
vader_lexicon
和 nltk 代码对句子中的否定语言进行了更多的解析,以否定肯定的词。就像达斯·维德 (Darth Vader) 说 "lack of faith" 时,情绪会发生相反的变化。
我在这里解释了,并附有更好结果的例子: https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/
替换此 textblob 实现:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE
vader nltk
分类器也有关于使用它进行情绪分析的额外文档:http://www.nltk.org/howto/sentiment.html
textBlob 总是让我的电脑崩溃,只有 5000 个例子。
另一种更具前瞻性的解决方案是使用spaCy来构建模型,而不是textblob
或nltk
。这对我来说是新的,但似乎更易于使用且功能更强大:
https://spacy.io/usage/spacy-101#section-lightning-tour
"spaCy is the Ruby of Rails of natural language processing."
import spacy
import random
nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through million", {'entities': [(0, 4, 'ORG')]})] # better model stuff
with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
optimizer = nlp.begin_training()
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')