ValueError: setting an array element with a sequence - after making TF_IDF vectorization

ValueError: setting an array element with a sequence - after making TF_IDF vectorization

我是数据科学和 NLP 的新手。我想对一些文本文档执行 TF_IDF 向量化,然后使用结果来训练不同的机器学习模型。但是当我尝试训练 SVC 模型时,我得到了 ValueError: setting an array element with a sequence。这是我的代码。

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
X = df['vect_message']
y = df['severity']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

from sklearn import svm
model = svm.SVC() 
model.fit(X_train, y_train) 
prediction = model.predict(X_test)

我在行 model.fit(X_train, y_train)

上遇到错误

我已经搜索过其他类似的问题,我找到了一个他们建议使用 .toarray() 方法将稀疏矩阵转换为 np.array 的问题。但这对我没有帮助。

当你执行以下行时:

df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])

Pandas 将 vectorizer.fit_transform() 的结果视为 标量 对象。 结果,您将在 vect_message 列的每一行中拥有相同的稀疏矩阵:

In [74]: df.loc[0, 'vect_message']
Out[74]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>

In [75]: df.loc[0, 'vect_message'].A
Out[75]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.70710678,  0.70710678],
       [ 1.        ,  0.        ,  0.        ,  0.        ]])

In [76]: df.loc[1, 'vect_message'].A
Out[76]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.70710678,  0.70710678],
       [ 1.        ,  0.        ,  0.        ,  0.        ]])

In [77]: df.loc[2, 'vect_message'].A
Out[77]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.70710678,  0.70710678],
       [ 1.        ,  0.        ,  0.        ,  0.        ]])

当我们做 df['new_col'] = 0 时基本上会发生同样的事情 - 我们将有一列 zeros

解决方法:

X = vectorizer.fit_transform(df['message_encoding'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

PS IMO 保存(以及尝试保存)2D 稀疏矩阵(vectorizer.fit_transform() 在 Pandas 列(系列)中调用的结果没有多大意义 -一维结构)