使用 sklearn 进行逻辑回归的情感分析
Sentiment Analysis with Logistic regression using sklearn
我正在尝试使用 sklearn 进行情绪分析。
以下是我的代码-
products=pd.read_csv("amazon_baby.csv")
products.head(3)
name review rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
import string
def remove_punctuation(text):
try:
text = text.translate(None, string.punctuation)
except:
translator = str(text).maketrans('', '', string.punctuation)
text = str(text).translate(translator)
return text
review_without_punctuation = products['review'].apply(remove_punctuation)
products["review_without_punctuation"] = products['review'].str.replace('[^\w\s]','')
products.dropna(inplace=True)
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
y=products['sentiment']
X=products['review_without_punctuation']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
count_Vect = CountVectorizer(stop_words='english')
train_data_vectorizer = count_Vect.fit_transform(X_train)
logisticRegr = LogisticRegression()
sentiment_model=logisticRegr.fit(train_data_vectorizer,y_train)
sample_test_data = X_test[10:13]
sample_test_data
87850 My daughter loves this The quality is great an...
8765 I have a home daycare and I LOVE this monitor ...
66497 Unique reasons why you should consider this be...
Name: review_without_punctuation, dtype: object
尝试使用经过训练的模型进行预测-
sample_test_data=count_Vect.fit_transform(sample_test_data)
sentiment_model.predict(sample_test_data)
出现以下错误-
ValueError Traceback (most recent call last)
ValueError: X has 73 features per sample; expecting 131364
我知道我在整个 X_test 上使用矢量化器并尝试用它进行预测的问题。
那么如何以其他方式对模型进行矢量化和训练呢?
它在代码中稍作修改就可以工作-
count_Vect = CountVectorizer().fit(X_train)
X_train_vectorized=count_Vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized,y_train)
predictions=model.predict(count_Vect.transform(X_train))
我发现这个错误是由于没有根据训练数据正确地向量化测试数据。流程应该是这样的——训练数据应该被拟合和转换,而测试数据应该相对于训练数据进行转换(不适合)。除了测试数据的向量化之外,一切看起来都很好。只需根据训练数据转换测试数据(不适合)。参考下面的代码,让我知道这是否是您要找的。
sample_test_data = count_Vect.transform(sample_test_data) # just transform
sentiment_model.predict(sample_test_data)
我正在尝试使用 sklearn 进行情绪分析。 以下是我的代码-
products=pd.read_csv("amazon_baby.csv")
products.head(3)
name review rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
import string
def remove_punctuation(text):
try:
text = text.translate(None, string.punctuation)
except:
translator = str(text).maketrans('', '', string.punctuation)
text = str(text).translate(translator)
return text
review_without_punctuation = products['review'].apply(remove_punctuation)
products["review_without_punctuation"] = products['review'].str.replace('[^\w\s]','')
products.dropna(inplace=True)
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
y=products['sentiment']
X=products['review_without_punctuation']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
count_Vect = CountVectorizer(stop_words='english')
train_data_vectorizer = count_Vect.fit_transform(X_train)
logisticRegr = LogisticRegression()
sentiment_model=logisticRegr.fit(train_data_vectorizer,y_train)
sample_test_data = X_test[10:13]
sample_test_data
87850 My daughter loves this The quality is great an...
8765 I have a home daycare and I LOVE this monitor ...
66497 Unique reasons why you should consider this be...
Name: review_without_punctuation, dtype: object
尝试使用经过训练的模型进行预测-
sample_test_data=count_Vect.fit_transform(sample_test_data)
sentiment_model.predict(sample_test_data)
出现以下错误-
ValueError Traceback (most recent call last)
ValueError: X has 73 features per sample; expecting 131364
我知道我在整个 X_test 上使用矢量化器并尝试用它进行预测的问题。 那么如何以其他方式对模型进行矢量化和训练呢?
它在代码中稍作修改就可以工作-
count_Vect = CountVectorizer().fit(X_train)
X_train_vectorized=count_Vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized,y_train)
predictions=model.predict(count_Vect.transform(X_train))
我发现这个错误是由于没有根据训练数据正确地向量化测试数据。流程应该是这样的——训练数据应该被拟合和转换,而测试数据应该相对于训练数据进行转换(不适合)。除了测试数据的向量化之外,一切看起来都很好。只需根据训练数据转换测试数据(不适合)。参考下面的代码,让我知道这是否是您要找的。
sample_test_data = count_Vect.transform(sample_test_data) # just transform
sentiment_model.predict(sample_test_data)