在 Python 中解压用于逻辑回归的字典
Unpack Dictionaries for Logistic Regression in Python
我正在尝试 运行 对产品评论进行一些情绪分析,但我对让我的模型阅读字数统计词典感到困惑
import pandas as pd
import numpy as np
from sklearn import linear_model, model_selection, metrics
products = pd.read_csv('data.csv')
def count_words(s):
d = {}
wl = str(s).split()
for w in wl:
d[w] = wl.count(w)
return d
products['word_count'] = products['review'].apply(count_words)
products = products[products['rating'] != 3]
products['sentiment'] = (products['rating'] >= 4) * 1
train_data, test_data = model_selection.train_test_split(products, test_size = 0.2, random_state=0)
sentiment_model = linear_model.LogisticRegression()
sentiment_model.fit(X = train_data['word_count'], y =train_data['sentiment'])
当我 运行 最后一行时,出现以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-0c3f47af3a6e> in <module>()
----> 1 sentiment_model.fit(X = train_data['word_count'], y =
train_data['sentiment'])
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\linear_model\logistic.py
in fit(self, X, y, sample_weight)
1171
1172 X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
-> 1173 order="C")
1174 check_classification_targets(y)
1175 self.classes_ = np.unique(y)
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\utils\validation.py in
check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
519 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
520 ensure_2d, allow_nd, ensure_min_samples,
--> 521 ensure_min_features, warn_on_dtype, estimator)
522 if multi_output:
523 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
TypeError: float() argument must be a string or a number, not 'dict'
似乎模型正在将字典作为 x 变量而不是字典中的条目。我想我需要将字典解压到数组中(?),但没有成功。
更新:
这是 运行ning word_count 和定义情绪之后的产品
products.head()
尝试
X = train_data['word_count'].values()
如果您要找的是 return train_data['word_count']
中每个项目的字数列表(数字)。
如果您只想更正错误,请先在 train_data['word_count']
上使用 DictVectorizer 将其转换为可接受的格式,即形状为 [n_samples, n_features]
的数组。
在 sentiment_model.fit()
之前将以下内容添加到您的代码中:
from sklearn.feature_extraction import DictVectorizer
dictVectorizer = DictVectorizer()
train_data_dict = dictVectorizer.fit_transform(train_data['word_count'])
然后像这样调用 sentiment_model.fit():
sentiment_model.fit(X = train_data_dict, y =train_data['sentiment'])
注:-
我建议您使用 CountVectorizer.
而不是实施您自己的计数词方法
from sklearn.feature_extraction.text import CountVectorizer
countVec = CountVectorizer()
train_data_vectorizer = countVec.fit_transform(train_data['review'])
sentiment_model.fit(X = train_data_vectorizer, y =train_data['sentiment'])
我正在尝试 运行 对产品评论进行一些情绪分析,但我对让我的模型阅读字数统计词典感到困惑
import pandas as pd
import numpy as np
from sklearn import linear_model, model_selection, metrics
products = pd.read_csv('data.csv')
def count_words(s):
d = {}
wl = str(s).split()
for w in wl:
d[w] = wl.count(w)
return d
products['word_count'] = products['review'].apply(count_words)
products = products[products['rating'] != 3]
products['sentiment'] = (products['rating'] >= 4) * 1
train_data, test_data = model_selection.train_test_split(products, test_size = 0.2, random_state=0)
sentiment_model = linear_model.LogisticRegression()
sentiment_model.fit(X = train_data['word_count'], y =train_data['sentiment'])
当我 运行 最后一行时,出现以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-0c3f47af3a6e> in <module>()
----> 1 sentiment_model.fit(X = train_data['word_count'], y =
train_data['sentiment'])
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\linear_model\logistic.py
in fit(self, X, y, sample_weight)
1171
1172 X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
-> 1173 order="C")
1174 check_classification_targets(y)
1175 self.classes_ = np.unique(y)
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\utils\validation.py in
check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
519 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
520 ensure_2d, allow_nd, ensure_min_samples,
--> 521 ensure_min_features, warn_on_dtype, estimator)
522 if multi_output:
523 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
TypeError: float() argument must be a string or a number, not 'dict'
似乎模型正在将字典作为 x 变量而不是字典中的条目。我想我需要将字典解压到数组中(?),但没有成功。
更新: 这是 运行ning word_count 和定义情绪之后的产品 products.head()
尝试
X = train_data['word_count'].values()
如果您要找的是 return train_data['word_count']
中每个项目的字数列表(数字)。
如果您只想更正错误,请先在 train_data['word_count']
上使用 DictVectorizer 将其转换为可接受的格式,即形状为 [n_samples, n_features]
的数组。
在 sentiment_model.fit()
之前将以下内容添加到您的代码中:
from sklearn.feature_extraction import DictVectorizer
dictVectorizer = DictVectorizer()
train_data_dict = dictVectorizer.fit_transform(train_data['word_count'])
然后像这样调用 sentiment_model.fit():
sentiment_model.fit(X = train_data_dict, y =train_data['sentiment'])
注:- 我建议您使用 CountVectorizer.
而不是实施您自己的计数词方法from sklearn.feature_extraction.text import CountVectorizer
countVec = CountVectorizer()
train_data_vectorizer = countVec.fit_transform(train_data['review'])
sentiment_model.fit(X = train_data_vectorizer, y =train_data['sentiment'])