TF-IDF 和文本块

Question

我是 NLP 的初学者，我正在使用 TF-IDF 方法应用 ML 模型。如果我有这样的数据集

dataset = ['I have three cars', 'and one motorbike']

哪个是应用 TF-IDF 的正确方法（A 或 B），为什么？

选项 1

Tfidf_vect = TfidfVectorizer(max_features=100000, ngram_range = (1,2))
Tfidf_vect.fit(dataset)

选项 2

for d in dataset:
  Tfidf_vect2 = TfidfVectorizer(max_features=100000, ngram_range = (1,2))
  Tfidf_vect2.fit(d)

此外，选项 2 不起作用，我不明白为什么。请帮助我。

Answer 1

TL;DR;正确的方法是选项 1(A)。

应用 TFIDF Vectorizer 的正确方法是使用文本语料库：

An iterable which yields either str, unicode or file objects.

根据the docs，你必须通过你的案例数组。

以及来自 Scikit-learn 文档的示例：

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)

Answer 2

选项1正确。 Fit 方法需要一个示例列表。检查这个：https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit

TF-IDF 和文本块

TF-IDF and text chunks

python

nlp

tf-idf