用于提取 ngram 的 TF-IDF 矢量器

Question

如何使用 scikit-learn 库中的 TF-IDF vectorizer 提取 unigrams 和 bigrams 推文？我想用输出训练一个分类器。

这是来自 scikit-learn 的代码：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

Answer 1

TfidfVectorizer 有一个 ngram_range 参数来确定你想要在最终矩阵中作为新特征的 n-grams 的范围。在您的情况下，您希望 (1,2) 从双字母组合到双字母组合：

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()

pd.DataFrame(X, columns=vectorizer.get_feature_names())

        and  and this  document  document is     first  first document  \
0  0.000000  0.000000  0.314532     0.000000  0.388510        0.388510   
1  0.000000  0.000000  0.455513     0.356824  0.000000        0.000000   
2  0.357007  0.357007  0.000000     0.000000  0.000000        0.000000   
3  0.000000  0.000000  0.282940     0.000000  0.349487        0.349487   

         is    is the   is this       one  ...       the  the first  \
0  0.257151  0.314532  0.000000  0.000000  ...  0.257151   0.388510   
1  0.186206  0.227756  0.000000  0.000000  ...  0.186206   0.000000   
2  0.186301  0.227873  0.000000  0.357007  ...  0.186301   0.000000   
3  0.231322  0.000000  0.443279  0.000000  ...  0.231322   0.349487   
...

Answer 2

根据文档：https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

你在初始化TfidfVectorizer时指定了n-grams，TfidfVectorizer(ngram_range(min_n, max_n)) 提取不同n-grams的范围n-values的上下边界 ngram_range of (1, 1)表示只有unigrams，(1, 2)表示unigrams和bigrams，(2, 2)表示只有bigrams .

答案是 vectorizer = TfidfVectorizer(ngram_range=(1,2))

用于提取 ngram 的 TF-IDF 矢量器

TF-IDF vectorizer to extract ngrams

python

n-gram

scikit-learn

tfidfvectorizer