TfidfVectorizer 和 SelectKBest 出错
Error with TfidfVectorizer and SelectKBest
我正在尝试按照本教程进行一些情绪分析,而且我很确定我的代码到目前为止完全相同。但是,我的 BOW 值出现了重大差异。
这是我到目前为止的代码。
import nltk
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
def openFile(path):
#param path: path/to/file.ext (str)
#Returns contents of file (str)
with open(path) as file:
data = file.read()
return data
imdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')
amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')
yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')
datasets = [imdb_data, amzn_data, yelp_data]
combined_dataset = []
# separate samples from each other
for dataset in datasets:
combined_dataset.extend(dataset.split('\n'))
# separate each label from each sample
dataset = [sample.split('\t') for sample in combined_dataset]
df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])
df = df[df["Labels"].notnull()]
df = df.sample(frac=1)
labels = df['Labels']
vectorizer = TfidfVectorizer(min_df=15)
bow = vectorizer.fit_transform(df['Reviews'])
len(vectorizer.get_feature_names())
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)
bow = vectorizer.fit_transform(df['Reviews'])
bow
这是我的结果。
这是教程的结果。
我一直在努力找出可能的问题所在,但我还没有找到任何进展。
问题是您提供的是索引,请尝试提供真实的词汇。
试试这个:
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vocabulary = np.array(vectorizer.get_feature_names())[selected_features]
vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here
bow = vectorizer.fit_transform(df['Reviews'])
bow
<3000x200 sparse matrix of type '<class 'numpy.float64'>'
with 12916 stored elements in Compressed Sparse Row format>
我正在尝试按照本教程进行一些情绪分析,而且我很确定我的代码到目前为止完全相同。但是,我的 BOW 值出现了重大差异。
这是我到目前为止的代码。
import nltk
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
def openFile(path):
#param path: path/to/file.ext (str)
#Returns contents of file (str)
with open(path) as file:
data = file.read()
return data
imdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')
amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')
yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')
datasets = [imdb_data, amzn_data, yelp_data]
combined_dataset = []
# separate samples from each other
for dataset in datasets:
combined_dataset.extend(dataset.split('\n'))
# separate each label from each sample
dataset = [sample.split('\t') for sample in combined_dataset]
df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])
df = df[df["Labels"].notnull()]
df = df.sample(frac=1)
labels = df['Labels']
vectorizer = TfidfVectorizer(min_df=15)
bow = vectorizer.fit_transform(df['Reviews'])
len(vectorizer.get_feature_names())
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)
bow = vectorizer.fit_transform(df['Reviews'])
bow
这是我的结果。
这是教程的结果。
我一直在努力找出可能的问题所在,但我还没有找到任何进展。
问题是您提供的是索引,请尝试提供真实的词汇。
试试这个:
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vocabulary = np.array(vectorizer.get_feature_names())[selected_features]
vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here
bow = vectorizer.fit_transform(df['Reviews'])
bow
<3000x200 sparse matrix of type '<class 'numpy.float64'>'
with 12916 stored elements in Compressed Sparse Row format>