信息功能代码不起作用

Informative Features Code not Working

我想在 SciKit Learn 中为二进制 NB 实现一个信息量最大的特征函数。我正在使用 Python3。

首先,我知道有人问过为 SciKit 的多项式 NB 实现某种 'informative features' 函数的问题。但是,我已经尝试了这些响应并且没有运气 - 所以我认为要么 SciKit 更新了,要么我做错了。我在用 tobigue 的 answer here 函数。

from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split



#Array contains a list of (headline, source) tupples where there are two sources. 
#I want to classify each headline as belonging to a given source. 
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]


#I want to classify each headline as belonging to a given source. 
def scikit_naivebayes(data_array):
    headlines = [element[0] for element in data_array]
    sources = [element[1] for element in data_array]
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
    cf1 = text_clf.fit(headlines, sources)
    train(cf1,headlines,sources)

    #Call most_informative_features function on CountVectorizer and classifier
    show_most_informative_features(CountVectorizer, cf1)


def train(classifier, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
    classifier.fit(X_train, y_train)
    print ("Accuracy: {}".format(classifier.score(X_test, y_test)))


#tobigue's code: 
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
    print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))


def main():
    scikit_naivebayes(array)


main()

#ERROR: 
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'

您需要在调用 vectorizer.get_feature_names() 之前调整 CountVectorizer。在您的代码中,您仅使用 class CountVectorizer 调用另一个函数,这不会导致任何结果。

您应该独立于您的管道尝试使用 CountVectorizer 创建一个矢量化器,然后在您的文本上调用 fit,并最终使用已经提供的函数,尽管您应该进一步调整它自己解决问题。

你应该很容易理解,你使用的函数需要一个实例化对象,而不是class。如果你不知道,请告诉我。

编辑

coef_ 是只能由估算器访问的属性,即 classifier(而不是全部)。 Pipeline 是一个 sklearn 对象,用于组合不同的步骤以提供 classifier。通常,词袋管道由一个特征提取器和一个 classifier(这里是逻辑回归)构成:

pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])

因此,在您的情况下,您应该避免使用管道(我建议您开始),或者使用管道中的 get_params() 方法来访问 classifier。

我建议你 fit_transform 文本,然后将转换后的结果提供给逻辑回归或朴素贝叶斯 classifier,然后调用你的函数 :

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)

首先尝试一下,如果可行,您将更好地理解如何使用管道。请注意,当您结合特征提取器时,您的流水线不应该工作,最后一步应该是一个估计器。如果你想堆叠到特征提取器,你需要注意 FeatureUnion