Python;使用 NGram 情感分析 - 无法获得前 5 个词

Python; Using NGram sentiment analysis - cannot get top 5 words

我按如下方式设置我的 CountVectorizer;

cv = CountVectorizer(binary=True)
X = cv.fit_transform(train_text)
X_test = cv.transform(test_text)

当我使用 SVM 时,我可以在我的情感分析中打印出前 5 个词;

final_svm  = LinearSVC(C=best_c)
final_svm.fit(X, target)
final_accuracy = final_svm.predict(X_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print ("Final SVM Accuracy: %s" % final_accuracy_score)
Report_Matricies.accuracy(target_test, final_accuracy)
feature_names = zip(cv.get_feature_names(), final_model.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:number_we_are_interested_in]

这样就可以了。但是当我为 NGram 尝试类似的代码时,我得到的是随机词;

   ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, no_of_words))
    X = ngram_vectorizer.fit_transform(train_text)
    X_test = ngram_vectorizer.transform(test_text)
    best_c = Logistic_Regression.get_best_hyperparameter(X_train, y_train, y_val, X_val)
    final_ngram = LogisticRegression(C=best_c)
    final_ngram.fit(X, target)
    final_accuracy = final_ngram.predict(X_test)
    final_accuracy_score = accuracy_score(target_test, final_accuracy)
    print ("Final NGram Accuracy: %s" % final_accuracy_score)
    Report_Matricies.accuracy(target_test, final_accuracy)
    feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])
    feature_to_coef = {
        word: coef for word, coef in feature_names
    }
    itemz = feature_to_coef.items()
    list_positive = sorted(
        itemz, 
        key=lambda x: x[1], 
        reverse=True)

我的 NGram 分析和 SVM 之间的准确率相似,所以我为 NGramm 使用的代码似乎不能正确提取我想要的那种词,即它们是随机词而不是肯定词.我应该改用什么代码? 可以在此参考中找到类似的代码,但第 2 部分中的示例不会打印 NGram 的前 5 个单词。 https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

当您实施在 ngrams 上训练的逻辑回归模型时,您似乎做得有点过头了 copy/paste。当您从此模型获得 feature_names 时,您使用的是二进制 CountVectorizer cv,而不是 ngram_vectorizer。我想你需要换行

feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])

feature_names = zip(ngram_vectorizer.get_feature_names(), final_ngram.coef_[0])

已由 aberger 回答,也许您应该替换:

  • "feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])" 通过
  • "feature_names = zip(ngram_vectorizer.get_feature_names(), final_ngram.coef_[0])"

一些额外的注意事项

在 NLP 中,NGrams 是将 N 个连续的单词视为单个单词的事实。它将用于 "tokenize" 您的文本语料库,以使该语料库可用于机器算法,但与算法本身无关。

SVM 和 Logistic 回归是两种主要用于分类的不同算法(逻辑回归是一种用于分离的回归类,正是我们使用它的方式使这种回归成为一种分类算法)。

我试图用无意义的数据(你可以用你的替换)来说明这一点,这样你就可以直接运行这段代码并观察结果。

如您所见,使用 NGrams 将给出几乎相同的热门词,除了我自己的一个二元组和一个三元组 运行 :

  • 没有 NGrams 的逻辑回归:[('the', 0.22492305532420143), ('boxing', 0.22366726197682427), ('jump', 0.22366726197682427), ('wizards', 0.22366726197)68242 , ('five', 0.21116962061694416)]
  • 使用 NGrams 进行逻辑回归:[('the', 0.1549468448457053), ('five', 0.15263348614045338), ('boxing', 0.12657434061922093), ('boxing wizards', 0.12657434061922093 , ('boxing wizards jump', 0.12657434061922093)]
  • 使用 NGrams 进行逻辑回归,但仅对 unigrams 进行排序:[('the', 0.1549468448457053), ('five', 0.15263348614045338), ('boxing', 0.12657434061922093), ('jump', 0.12657434061922093), ('wizards', 0.12657434061922093)] <- 给出与 "Logistic regression without NGrams" 几乎相同的东西(与模型使用不同标记学习的不完全相同,即此处的额外 NGram)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

text_train = ["The quick brown fox jumps over a lazy dog",
        "Pack my box with five dozen liquor jugs",
        "How quickly daft jumping zebras vex",
        "The five boxing wizards jump quickly",
        "the fox of my friend it the most lazy one I have seen in the past five years"]

text_test = ["just for a test"]

target_train = [1, 1, 0, 1, 0]

target_test = [1]

#######################################################################
##       OBSERVING TOKENIZATION OF DATA WITH AND WITOUT NGRAMS       ##
#######################################################################

## WITHOUT NGRAMS

cv = CountVectorizer()
count_vector = cv.fit_transform(text_train)
#Display the dictionary pairing each single word and it's position in the
#"vectorized" version of our text corpus, without any count.
print("")
print(cv.vocabulary_)
print("")
print("")
print(dict(zip(cv.get_feature_names(), count_vector.toarray().sum(axis=0))))

##  WITH NGRAMS

#Now let's also add as meaningfull entities all pair and all trios of words
#using NGrams
cv = CountVectorizer(ngram_range=(1,3))
count_vector = cv.fit_transform(text_train)
#Observe that now, "jump quickly" and "large fawn jumped" for instance are 
#considered as sort of meaningful unique "words" composed of several unique
#words.
print("")
print("")
print(cv.vocabulary_)
print("")
print("")
#List of all words and counts their occurences
print(dict(zip(cv.get_feature_names(), count_vector.toarray().sum(axis=0))))

#######################################################################
##                    YOUR ATTEMPT WITH LINEARSVC                    ##
#######################################################################
cv1 = CountVectorizer(binary=True)
count_vector_train = cv1.fit_transform(text_train)
count_vector_test = cv1.transform(text_test)

final_svm  = LinearSVC(C=1.0)
final_svm.fit(count_vector_train, target_train)
final_accuracy = final_svm.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final SVM without NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv1.get_feature_names(), final_svm.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("SVM without NGrams")
print(list_positive)

#######################################################################
##              YOUR ATTEMPT WITH LOGISTIC REGRESSION                ##
#######################################################################
cv2 = CountVectorizer(binary=True)
count_vector_train = cv2.fit_transform(text_train)
count_vector_test = cv2.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression without NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv2.get_feature_names(), final_lr.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression without NGrams")
print(list_positive)

#######################################################################
##         YOUR ATTEMPT WITH LOGISTIC REGRESSION AND NGRAMS          ##
#######################################################################
cv3 = CountVectorizer(binary=True, ngram_range=(1,3))
count_vector_train = cv3.fit_transform(text_train)
count_vector_test = cv3.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression with NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv3.get_feature_names(), final_lr.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression with NGrams")
print(list_positive)

#######################################################################
##         YOUR ATTEMPT WITH LOGISTIC REGRESSION AND NGRAMS          ##
##                BUT EXTRACTS ONLY REAL UNIQUE WORDS                ##
#######################################################################
cv4 = CountVectorizer(binary=True, ngram_range=(1,3))
count_vector_train = cv4.fit_transform(text_train)
count_vector_test = cv4.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression with NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv4.get_feature_names(), final_lr.coef_[0])
feature_names_unigrams = [(a, b) for a, b in feature_names if len(a.split()) < 2]
feature_to_coef = {
    word: coef for word, coef in feature_names_unigrams
}
itemz = feature_to_coef.items()

list_positive = sorted(
    itemz,
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression with NGrams but only getting unigrams")
print(list_positive)