TfidfVectorizer 将我的数据帧从 799 缩小到 3

TfidfVectorizer shrinks my datafreame from 799 to 3

我有包含文本列的数据框

和多标签值

RepID、RepText、代码 1 这是一个测试。感谢您购买...水果、肉类 2 买了牛奶,还有香蕉,我也买了... 奶制品,水果,其他

这是我的代码

######## df has 1000 records

multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['Code'])
y = multilabel_binarizer.transform(df['Code'])
X = df[df.columns.difference(["Code"])]

######## df split into X (RepID, RepText)
######## and y (Code)


xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=9)

##### xtrain.shape = (800,3)
##### xval.shape = (200,3)
##### ytrain.shape = (800,1725)
##### yval.shape = (200,1725)


tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

##### But after the code above
##### xtrain_tfidf.shape = (3,3)
##### xval_tfidf.shape = (3,3)
##### ytrain.shape = (800,1725)
##### yval.shape = (200,1725)

##### when means when I do the next line

xval_tfidf.shape

#mdl = LinearRegression()
mdl = LogisticRegression()
#mdl = SVC(gamma='auto', probability=True)
clf = OneVsRestClassifier(mdl)
clf.fit(xtrain_tfidf, ytrain)

我收到这个错误

ValueError: Found input variables with inconsistent numbers of samples: [3, 799]

为什么?为什么在 TfidfVectorizer 行后我只得到 3 条记录 而不是 800

当我试图查看 xtrain_tfidf 中的内容时,我得到了这个

xtrain_tfidf
Out[56]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

找到原因了

我忘记在拆分记录时只选择文本列

xtrain, xval, ytrain, yval = train_test_split(X["RepText"], y, test_size=0.2, random_state=9)