NLP 文本分类 CountVectorizer Shape Error

Question

我有一个文本数据集，其中一列用于评论，另一列用于标签。我想通过使用该数据集构建决策树模型，我使用了矢量化器但它给出了 ValueError: Number of labels=37500 does not match number of samples=1 错误。 vect.vocabulary_ returns {'review': 0} review 是栏目名称。所以我认为它并不适合所有数据。这是下面的代码，任何帮助表示赞赏。

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(data.iloc[:,:-1],data.iloc[:,-1:],
test_size = 0.25, random_state = 42)

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

from sklearn.tree import DecisionTreeClassifier 
DTC = DecisionTreeClassifier()
DTC.fit(X_train_dtm, y_train)
y1_pred_class = DTC.predict(X_test_dtm)

另外 X_train_dtm.shape 是 <bound method spmatrix.get_shape of <1x1 sparse matrix of type '<class 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format>>

Answer 1

CountVectorizer需要一维输入，错误提示你的X_train是二维的。如果它是一个数据框，减少到一个系列；如果它是一个 numpy 数组，请使用 reshape 或 ravel.

Answer 2

当我更改这部分时它起作用了：

X_train, X_test,y_train, y_test = train_test_split(数据['text'], 数据['tag'],test_size = 0.25, random_state = 42)

NLP 文本分类 CountVectorizer Shape Error

NLP text classification CountVectorizer Shape Error

python

nlp

decision-tree

scikit-learn

text-classification