在搜索引擎中使用 ScikitLearn TfidfVectorizer
Using ScikitLearn TfidfVectorizer in a search engine
我正在考虑创建一个搜索引擎,我可以使用关键字从预处理的 pdf 文件中获取句子(代表文档)。
我想知道 scikit-learn 中是否有一个内置函数可以类似于一袋单词输出来显示数据,这意味着我会将所有单词作为列(在 pandas ),所有文档作为行,tf-idf 值作为值
您在 scikit 学习 TfIdfVectorizer and TfIdfTransformer。
结果就是您需要的结果,它们之间的区别如下:
TfIdfVectorizer 将原始文档作为输入。
TfIdfTransformer 在输入中采用包含每个文档工作计数的矩阵。
你当然可以在玩具问题中这样做并且仅用于教育目的,但它完全不切实际并且非常不可取那些。
原因是这样的术语文档矩阵是稀疏(即它们的大部分条目实际上是0),这种稀疏性用于它们在适当的数据结构中的有效存储.将它们转换为非稀疏结构(即 pandas 数据帧)很可能会占用您机器的内存;引用相关的 scikit-learn docs:
As most documents will typically use a very small subset of the words
used in the corpus, the resulting matrix will have many feature values
that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as
emails) will use a vocabulary with a size in the order of 100,000
unique words in total while each document will use 100 to 1000 unique
words individually.
In order to be able to store such a matrix in memory but also to speed
up algebraic operations matrix / vector, implementations will
typically use a sparse representation such as the implementations
available in the scipy.sparse
package.
也就是说,您可以出于教育目的这样做;这是如何调整 TfidfVectorizer
docs:
中的示例
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame.sparse.from_spmatrix(X, columns = vectorizer.get_feature_names())
df
# result:
and document first is one second the third this
0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089 0.000000 0.281089
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104 0.511849 0.267104
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085
我正在考虑创建一个搜索引擎,我可以使用关键字从预处理的 pdf 文件中获取句子(代表文档)。
我想知道 scikit-learn 中是否有一个内置函数可以类似于一袋单词输出来显示数据,这意味着我会将所有单词作为列(在 pandas ),所有文档作为行,tf-idf 值作为值
您在 scikit 学习 TfIdfVectorizer and TfIdfTransformer。
结果就是您需要的结果,它们之间的区别如下:
TfIdfVectorizer 将原始文档作为输入。
TfIdfTransformer 在输入中采用包含每个文档工作计数的矩阵。
你当然可以在玩具问题中这样做并且仅用于教育目的,但它完全不切实际并且非常不可取那些。
原因是这样的术语文档矩阵是稀疏(即它们的大部分条目实际上是0),这种稀疏性用于它们在适当的数据结构中的有效存储.将它们转换为非稀疏结构(即 pandas 数据帧)很可能会占用您机器的内存;引用相关的 scikit-learn docs:
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the
scipy.sparse
package.
也就是说,您可以出于教育目的这样做;这是如何调整 TfidfVectorizer
docs:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame.sparse.from_spmatrix(X, columns = vectorizer.get_feature_names())
df
# result:
and document first is one second the third this
0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089 0.000000 0.281089
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104 0.511849 0.267104
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085