使用预训练数据的文本分类 Python

Question

如何将我的 tfidf 矩阵与类别相关联？例如我有以下数据集

**ID**        **Text**                                     **Category**
   1     jake loves me more than john loves me               Romance
   2     july likes me more than robert loves me             Friendship
   3     He likes videogames more than baseball              Interest

一旦我通过将“Text”列作为输入来计算每个句子的 tfidf，我将如何训练系统对矩阵的该行进行分类与我上面的类别相关联，以便我可以重复使用我的测试数据？

使用上面的训练数据集，当我传递一个新句子时 'julie is a lovely person'，我希望将该句子分类为单个或多个预定义类别，如上所示。

我用这个 link 作为解决这个问题的起点，但我无法理解如何将一个句子的 tfidf 矩阵映射到一个类别

Answer 1

看起来您已经对文本进行了矢量化，即已经将文本转换为数字，以便您可以使用 scinkit-learns 分类器。现在下一步是训练分类器。可以关注this link。它看起来像这样：

矢量化

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train = count_vect.fit_transform(your_text)

训练分类器

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)

预测新文档:

docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new = count_vect.transform(docs_new)
predicted = clf.predict(X_new)

使用预训练数据的文本分类 Python

Text Categorization Python with pre-trained data

tf-idf

python-3.x

scikit-learn

text-classification