用于数据过滤的 TF-IDF

Question

我有一个原始文档列表，已经过滤并删除了英文停用词：

rawDocument = ['sport british english sports american english includes forms competitive physical activity games casual organised ...', 'disaster serious disruption occurring relatively short time functioning community society involving ...', 'government system group people governing organized community often state case broad associative definition ...', 'technology science craft greek τέχνη techne art skill cunning hand λογία logia collection techniques ...']

而且我用过

from sklearn.feature_extraction.text import TfidfVectorizer
sklearn_tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=False)
sklearn_representation = sklearn_tfidf.fit_transform(rawDocuments)

但是我得到了一个

<4x50 sparse matrix of type '<class 'numpy.float64'>'
    with 51 stored elements in Compressed Sparse Row format>

我无法解释结果。那么，我是在使用正确的工具还是需要改变方式？

我的目标是获取每个文档中的相关词，以便与查询文档中的其他词执行余弦相似度。

提前致谢。

Answer 1

通常 Pandas 模块可用于更好地可视化您的数据：

演示：

import pandas as pd

df = pd.SparseDataFrame(sklearn_tfidf.fit_transform(rawDocument),
                        columns=sklearn_tfidf.get_feature_names(),
                        default_fill_value=0)

结果：

In [85]: df
Out[85]:
   activity  american       art  associative  british    ...       system    techne  techniques  technology      time
0      0.25      0.25  0.000000     0.000000     0.25    ...     0.000000  0.000000    0.000000    0.000000  0.000000
1      0.00      0.00  0.000000     0.000000     0.00    ...     0.000000  0.000000    0.000000    0.000000  0.308556
2      0.00      0.00  0.000000     0.282804     0.00    ...     0.282804  0.000000    0.000000    0.000000  0.000000
3      0.00      0.00  0.288675     0.000000     0.00    ...     0.000000  0.288675    0.288675    0.288675  0.000000

[4 rows x 48 columns]

用于数据过滤的 TF-IDF

TF-IDF for data filtering

python

tf-idf

scikit-learn

tfidfvectorizer