ValueError: setting an array element with a sequence - after making TF_IDF vectorization
ValueError: setting an array element with a sequence - after making TF_IDF vectorization
我是数据科学和 NLP 的新手。我想对一些文本文档执行 TF_IDF 向量化,然后使用结果来训练不同的机器学习模型。但是当我尝试训练 SVC 模型时,我得到了 ValueError: setting an array element with a sequence。这是我的代码。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
X = df['vect_message']
y = df['severity']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn import svm
model = svm.SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
我在行 model.fit(X_train, y_train)
上遇到错误
我已经搜索过其他类似的问题,我找到了一个他们建议使用 .toarray()
方法将稀疏矩阵转换为 np.array 的问题。但这对我没有帮助。
当你执行以下行时:
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
Pandas 将 vectorizer.fit_transform()
的结果视为 标量 对象。
结果,您将在 vect_message
列的每一行中拥有相同的稀疏矩阵:
In [74]: df.loc[0, 'vect_message']
Out[74]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [75]: df.loc[0, 'vect_message'].A
Out[75]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
In [76]: df.loc[1, 'vect_message'].A
Out[76]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
In [77]: df.loc[2, 'vect_message'].A
Out[77]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
当我们做 df['new_col'] = 0
时基本上会发生同样的事情 - 我们将有一列 zeros
解决方法:
X = vectorizer.fit_transform(df['message_encoding'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
PS IMO 保存(以及尝试保存)2D 稀疏矩阵(vectorizer.fit_transform()
在 Pandas 列(系列)中调用的结果没有多大意义 -一维结构)
我是数据科学和 NLP 的新手。我想对一些文本文档执行 TF_IDF 向量化,然后使用结果来训练不同的机器学习模型。但是当我尝试训练 SVC 模型时,我得到了 ValueError: setting an array element with a sequence。这是我的代码。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
X = df['vect_message']
y = df['severity']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn import svm
model = svm.SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
我在行 model.fit(X_train, y_train)
我已经搜索过其他类似的问题,我找到了一个他们建议使用 .toarray()
方法将稀疏矩阵转换为 np.array 的问题。但这对我没有帮助。
当你执行以下行时:
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
Pandas 将 vectorizer.fit_transform()
的结果视为 标量 对象。
结果,您将在 vect_message
列的每一行中拥有相同的稀疏矩阵:
In [74]: df.loc[0, 'vect_message']
Out[74]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [75]: df.loc[0, 'vect_message'].A
Out[75]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
In [76]: df.loc[1, 'vect_message'].A
Out[76]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
In [77]: df.loc[2, 'vect_message'].A
Out[77]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
当我们做 df['new_col'] = 0
时基本上会发生同样的事情 - 我们将有一列 zeros
解决方法:
X = vectorizer.fit_transform(df['message_encoding'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
PS IMO 保存(以及尝试保存)2D 稀疏矩阵(vectorizer.fit_transform()
在 Pandas 列(系列)中调用的结果没有多大意义 -一维结构)