NLP 文本分类 CountVectorizer Shape Error
NLP text classification CountVectorizer Shape Error
我有一个文本数据集,其中一列用于评论,另一列用于标签。我想通过使用该数据集构建决策树模型,我使用了矢量化器但它给出了 ValueError: Number of labels=37500 does not match number of samples=1
错误。 vect.vocabulary_ returns {'review': 0}
review 是栏目名称。所以我认为它并不适合所有数据。这是下面的代码,任何帮助表示赞赏。
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(data.iloc[:,:-1],data.iloc[:,-1:],
test_size = 0.25, random_state = 42)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier()
DTC.fit(X_train_dtm, y_train)
y1_pred_class = DTC.predict(X_test_dtm)
另外 X_train_dtm.shape 是 <bound method spmatrix.get_shape of <1x1 sparse matrix of type '<class 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format>>
CountVectorizer
需要一维输入,错误提示你的X_train
是二维的。如果它是一个数据框,减少到一个系列;如果它是一个 numpy 数组,请使用 reshape
或 ravel
.
当我更改这部分时它起作用了:
X_train, X_test,y_train, y_test = train_test_split(数据['text'],
数据['tag'],test_size = 0.25, random_state = 42)
我有一个文本数据集,其中一列用于评论,另一列用于标签。我想通过使用该数据集构建决策树模型,我使用了矢量化器但它给出了 ValueError: Number of labels=37500 does not match number of samples=1
错误。 vect.vocabulary_ returns {'review': 0}
review 是栏目名称。所以我认为它并不适合所有数据。这是下面的代码,任何帮助表示赞赏。
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(data.iloc[:,:-1],data.iloc[:,-1:],
test_size = 0.25, random_state = 42)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier()
DTC.fit(X_train_dtm, y_train)
y1_pred_class = DTC.predict(X_test_dtm)
另外 X_train_dtm.shape 是 <bound method spmatrix.get_shape of <1x1 sparse matrix of type '<class 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format>>
CountVectorizer
需要一维输入,错误提示你的X_train
是二维的。如果它是一个数据框,减少到一个系列;如果它是一个 numpy 数组,请使用 reshape
或 ravel
.
当我更改这部分时它起作用了:
X_train, X_test,y_train, y_test = train_test_split(数据['text'], 数据['tag'],test_size = 0.25, random_state = 42)