如何仅使用 TfidfVectorizer 获取 TF?

How to obtain TF using only TfidfVectorizer?

我有这样一个代码:

 corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'This document is the fourth document.',
        'And this is the fifth one.',
        'This document is the sixth.',
        'And this is the seventh one document.',
        'This document is the eighth.',
        'And this is the nineth one document.',
        'This document is the second.',
        'And this is the tenth one document.',
    ]

    vectorizer = skln.TfidfVectorizer() 
    X = vectorizer.fit_transform(corpus)
    tfidf_matrix = X.toarray()
    accumulated = [0] * len(vectorizer.get_feature_names())

    for i in range(tfidf_matrix.shape[0]):
        for j in range(len(vectorizer.get_feature_names())):
            accumulated[j] += tfidf_matrix[i][j]

    accumulated = sorted(accumulated)[-CENTRAL_TERMS:]
    print(accumulated)

我在哪里打印 CENTRAL_TERMS 个单词,这些单词在语料库的所有文档中获得最高的 tf-idf 分数。

不过,我也想获取语料库所有文档的MOST_REPEATED_TERMS个词。这些是具有最高 tf 分数的词。我知道我可以通过简单地使用 CountVectorizer 来获得,但我只想使用 TfidfVectorizer (为了不首先执行 vectorizer.fit_transform(corpus)TfidfVectorizer 然后 vectorizer.fit_transform(corpus) 用于 CountVectorizer。我也知道我可以先使用 CountVectorizer(获得 tf 分数)然后使用 TfidfTransformer(获得 tf-idf 分数)。但是,我认为必须有办法只使用 TfidfVectorizer.

如果有办法做到这一点,请告诉我(欢迎提供任何信息)。

你可以这样完成你的工作

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'This document is the fourth document.',
        'And this is the fifth one.',
        'This document is the sixth.',
        'And this is the seventh one document.',
        'This document is the eighth.',
        'And this is the nineth one document.',
        'This document is the second.',
        'And this is the tenth one document.',
    ]
#define the vectorization model
vectorize = TfidfVectorizer (max_features=2500, min_df=0.1, max_df=0.8)

#pass the corpus into the defined vectorizer
vector_texts = vectorize.fit_transform(corpus).toarray()
vector_texts
  • 您必须更改 max_features, min_df, max_df 值才能最适合您的 model.In 我的情况
out[1]:
array([[0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [0.70710678, 0.70710678, 0.        ],
       [0.        , 0.        , 0.        ],
       [0.70710678, 0.70710678, 0.        ],
       [0.        , 0.        , 0.        ],
       [0.70710678, 0.70710678, 0.        ],
       [0.        , 0.        , 0.        ],
       [0.70710678, 0.70710678, 0.        ],
       [0.        , 0.        , 1.        ],
       [0.70710678, 0.70710678, 0.        ]])

默认情况下,TfidfVectorizertfidf 相乘后进行 l2 归一化。因此,当您有 norm='l2' 时,我们无法获得术语频率。参考 here and here

如果你能在没有规范的情况下工作,那就有解决办法。

import scipy.sparse as sp
import pandas as pd 

vectorizer = TfidfVectorizer(norm=None) 
X = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
n = len(features)
inverse_idf = sp.diags(1/vectorizer.idf_,
                       offsets=0,
                       shape=(n, n),
                       format='csr',
                       dtype=np.float64).toarray()

pd.DataFrame(X*inverse_idf, 
            columns=features)