SVC text classification- TypeError: unhashable type: 'csr_matrix'

Question

我在机器学习领域还很陌生。我正在尝试构建一个 SVC 文本分类器。但是，当我尝试进行单个预测时，出现错误 unhashable type: 'csr_matrix'。我不确定为什么会这样。

objective是对一个包含[text,label]列的数据集进行二分类，其中第一个是句子，第二个是0或1。

我可以在 X_test 中进行预测，但我无法将其结果用于单个预测。

代码

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

tfid = TfidfVectorizer(encoding='utf-8', lowercase=True, analyzer='word')
X = tfid.fit_transform(df['text'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

## Training the SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state=42)
classifier.fit(X_train, y_train)

## Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred
array([0, 1, 1, ..., 0, 0, 1])

## Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
[[3762   61]
 [  43 3919]]
0.9866409762363519

问题来了

# Loading tfid with model.feature_names as vocabulary
tfid = TfidfVectorizer(encoding='utf-8', lowercase=True, analyzer='word', vocabulary=X_train)

## Predicting a new result
to_pred = tfid.fit_transform([df['text'].iloc[0]])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-9be72cc31a52> in <module>()
      1 ## Predicting a new result
----> 2 to_pred = tfid.fit_transform([df['text'].iloc[0]])

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py in _validate_vocabulary(self)
    469                 vocab = {}
    470                 for i, t in enumerate(vocabulary):
--> 471                     if vocab.setdefault(t, i) != i:
    472                         msg = "Duplicate term in vocabulary: %r" % t
    473                         raise ValueError(msg)

TypeError: unhashable type: 'csr_matrix'

这是它的样子df['text'].iloc[0]]:

df['text'].iloc[0]
'coming up with a baby name is hard being lazy is much easier'

Answer 1

这段代码有一些问题。

首先，您要在训练和测试数据上拟合您的 tf-idf。这不是一个好习惯。在现实生活中，您无权访问测试数据集。你应该分成训练和测试，然后 fit_transform 你的 tfidf 在你的训练集上，然后简单地转换你的测试集（假装你不知道你的测试集上有什么，就像现实生活一样）。

另一个问题是你创建了一个新的tfidf实例来转换你想要预测的句子。您应该尝试加载您创建的 tfidf 实例：

#imagine that you put this after the code above (so the tfidf here is fitted on train data)
to_pred = tfid.transform(['that thing you said about being lazy'])
#then predict
print(classifier.predict(to_pred))

您收到此错误的原因是因为在词汇表中它不需要 csr 矩阵（也就是用 tfidf 转换后的文本数据 - returns 一个稀疏矩阵对象以提高效率）。它需要一个像这样的字典：

{'love': 5, 'apples': 1, 'are': 2, 'healthy': 4, 'and': 0, 'fun': 3, 'red': 6}

但这不重要，因为无论如何这是错误的。

SVC text classification- TypeError: unhashable type: 'csr_matrix'

SVC text classification- TypeError: unhashable type: 'csr_matrix'

python

numpy

machine-learning

svc

scikit-learn

代码

问题来了