使用 sklearn 进行逻辑回归的情感分析

Sentiment Analysis with Logistic regression using sklearn

我正在尝试使用 sklearn 进行情绪分析。 以下是我的代码-

products=pd.read_csv("amazon_baby.csv")
products.head(3)

           name                                    review                                  rating
0   Planetwise Flannel Wipes    These flannel wipes are OK, but in my opinion ...           3
1   Planetwise Wipe Pouch   it came early and was not disappointed. i love...               5
2   Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l...   5

import string 
def remove_punctuation(text):
     try:
         text = text.translate(None, string.punctuation) 
     except: 
         translator = str(text).maketrans('', '', string.punctuation)
         text = str(text).translate(translator)
     return text

review_without_punctuation = products['review'].apply(remove_punctuation)

products["review_without_punctuation"] = products['review'].str.replace('[^\w\s]','')
products.dropna(inplace=True)
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
y=products['sentiment']
X=products['review_without_punctuation']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

count_Vect = CountVectorizer(stop_words='english')
train_data_vectorizer = count_Vect.fit_transform(X_train)
logisticRegr = LogisticRegression()
sentiment_model=logisticRegr.fit(train_data_vectorizer,y_train)

sample_test_data = X_test[10:13]
sample_test_data
87850    My daughter loves this The quality is great an...
8765     I have a home daycare and I LOVE this monitor ...
66497    Unique reasons why you should consider this be...
Name: review_without_punctuation, dtype: object

尝试使用经过训练的模型进行预测-

sample_test_data=count_Vect.fit_transform(sample_test_data)
sentiment_model.predict(sample_test_data)

出现以下错误-

ValueError                                Traceback (most recent call last)
ValueError: X has 73 features per sample; expecting 131364

我知道我在整个 X_test 上使用矢量化器并尝试用它进行预测的问题。 那么如何以其他方式对模型进行矢量化和训练呢?

它在代码中稍作修改就可以工作-

count_Vect = CountVectorizer().fit(X_train)
X_train_vectorized=count_Vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized,y_train)
predictions=model.predict(count_Vect.transform(X_train))

我发现这个错误是由于没有根据训练数据正确地向量化测试数据。流程应该是这样的——训练数据应该被拟合和转换,而测试数据应该相对于训练数据进行转换(不适合)。除了测试数据的向量化之外,一切看起来都很好。只需根据训练数据转换测试数据(不适合)。参考下面的代码,让我知道这是否是您要找的。

sample_test_data = count_Vect.transform(sample_test_data)  # just transform
sentiment_model.predict(sample_test_data)