CountVectorizer Error: ValueError: setting an array element with a sequence

CountVectorizer Error: ValueError: setting an array element with a sequence

我有一个包含 144 个学生反馈的数据集,其中分别有 72 个正面反馈和 72 个负面反馈。数据集有两个属性,即数据和目标,分别包含句子和情感(正面或负面)。 考虑以下代码:

import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)  


    data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive
8                           good communication skills.  positive
9                               good teaching methods.  positive
10   posseses very good and thorough knowledge of t...  positive
11   posseses superb ability to provide a lots of i...  positive
12   good conceptual skills and knowledge for subject.  positive
13                      no commuication outside class.  negative
14                                     rude behaviour.  negative
15            very negetive attitude towards students.  negative
16   good communication skills, lacks time punctual...  positive
17   explains in a better way by giving practical e...  positive
18                               hardly comes on time.  negative
19                          good communication skills.  positive
20   to make students comfortable with the subject,...  negative
21                       associated to original world.  positive
22                             lacks time punctuality.  negative

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X_test = cv.transform(feedback_data_test)

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i<72 else 0 for i in range(144)]
print(target)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
#The below line gives the error
clf.fit(X , target)

不知道怎么回事。请帮忙

错误来自 X 的完成方式。您不能在 Fit 方法中直接使用 X。您首先需要对其进行更多转换(对于其他问题,我无法告诉您,因为我没有相关信息)

现在你有以下内容:

array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
   ...
   <1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)

足以拆分。 我们只是将其转换为您可以理解的方式, fit 方法也将如此:

X = list([list(x.toarray()[0]) for x in X])

我们所做的是将稀疏矩阵转换为 numpy 数组,获取第一个元素(它只有一个元素),然后将其转换为列表以确保它具有正确的维度。

现在我们为什么要这样做:

X 是这样的

>>>X[0]
   <1x23 sparse matrix of type '<class 'numpy.int64'>'
   with 5 stored elements in Compressed Sparse Row format>

所以我们转换它看看它到底是什么:

>>>X[0].toarray()
   array([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
         0]], dtype=int64)

然后如您所见,维度有一个小问题,所以我们取第一个元素。

回到列表没有任何作用,只是让你更好地理解你所看到的。 (你可以转储它以提高速度)

您的代码现在是这样的:

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)