在搜索引擎中使用 ScikitLearn TfidfVectorizer

Using ScikitLearn TfidfVectorizer in a search engine

我正在考虑创建一个搜索引擎,我可以使用关键字从预处理的 pdf 文件中获取句子(代表文档)。

我想知道 scikit-learn 中是否有一个内置函数可以类似于一袋单词输出来显示数据,这意味着我会将所有单词作为列(在 pandas ),所有文档作为行,tf-idf 值作为值

您在 scikit 学习 TfIdfVectorizer and TfIdfTransformer

结果就是您需要的结果,它们之间的区别如下:

TfIdfVectorizer 将原始文档作为输入。

TfIdfTransformer 在输入中采用包含每个文档工作计数的矩阵。

你当然可以在玩具问题中这样做并且仅用于教育目的,但它完全不切实际并且非常不可取那些。

原因是这样的术语文档矩阵是稀疏(即它们的大部分条目实际上是0),这种稀疏性用于它们在适当的数据结构中的有效存储.将它们转换为非稀疏结构(即 pandas 数据帧)很可能会占用您机器的内存;引用相关的 scikit-learn docs:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

也就是说,您可以出于教育目的这样做;这是如何调整 TfidfVectorizer docs:

中的示例
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame.sparse.from_spmatrix(X, columns = vectorizer.get_feature_names())
df
# result:


    and         document    first       is          one         second      the         third       this
0   0.000000    0.469791    0.580286    0.384085    0.000000    0.000000    0.384085    0.000000    0.384085
1   0.000000    0.687624    0.000000    0.281089    0.000000    0.538648    0.281089    0.000000    0.281089
2   0.511849    0.000000    0.000000    0.267104    0.511849    0.000000    0.267104    0.511849    0.267104
3   0.000000    0.469791    0.580286    0.384085    0.000000    0.000000    0.384085    0.000000    0.384085