Texthero TF-IDF 计算

Question

通过Texthero计算TF-IDF有什么区别：

import texthero as hero
s = pd.Series(["Sentence one", "Sentence two"])
hero.tfidf(s, return_feature_names=True)
0    [0.5797386715376657, 0.8148024746671689, 0.0]
1    [0.5797386715376657, 0.0, 0.8148024746671689]
['Sentence', 'one', 'two'])

还有 sklearn 的 TD-IDF？鉴于这些例句，我希望 sklearn 的结果。

from sklearn.feature_extraction.text import TfidfVectorizer
...
Sentence    one two
0   0.0 0.346574    0.000000
1   0.0 0.000000    0.346574

Answer 1

简答

tfidf 不对输入文本进行预处理，仅应用 TF-IDF 算法，而默认情况下 TfidfVectorizer 对输入进行预处理。

函数签名

不同之处在于您处理这两个框架的方式。

查看函数签名：

scikit-learn TfidfVectorizer:

sklearn.feature_extraction.text.TfidfVectorizer(
    *, 
    input='content', 
    encoding='utf-8', 
    decode_error='strict', 
    strip_accents=None, 
    lowercase=True, 
    preprocessor=None, 
    tokenizer=None, 
    analyzer='word', 
    stop_words=None, 
    token_pattern='(?u)\b\w\w+\b', 
    ngram_range=(1, 1), 
    max_df=1.0, 
    min_df=1, 
    max_features=None, 
    vocabulary=None, 
    binary=False, 
    dtype=<class 'numpy.float64'>, 
    norm='l2', 
    use_idf=True, 
    smooth_idf=True, 
    sublinear_tf=False
)

Texthero tfidf:

tfidf(
    s: pandas.core.series.Series, 
    max_features=None, 
    min_df=1, 
    return_feature_names=False
)

对于 scikit-learn，不同的文本预处理步骤包含在 TfidfVectorizer 中。以Texthero的tfidf为例，没有进行文本预处理。

你的例子

在您的示例中，tf-idf 值在两种情况下是不同的，例如 TfidfVectorizer 默认情况下将所有字符转换为小写。

哪个比较好？

根据您的任务，两种解决方案中的一种可能更方便。

如果您正在使用 Pandas Dataframe/Series 进行自然语言预处理任务，并且希望对代码进行精细控制，那么使用 tfidf 可能更方便.

另一方面，如果您正在处理更通用的 ML 任务，您还需要处理一些文本并且只想快速表示它，那么您可以选择 TfidfVectorizer 使用默认设置。

Texthero TF-IDF 计算

Texthero TD-IDF Calculation

python

tf-idf

tfidfvectorizer