如何仅使用 TfidfVectorizer 获取 TF?
How to obtain TF using only TfidfVectorizer?
我有这样一个代码:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'This document is the fourth document.',
'And this is the fifth one.',
'This document is the sixth.',
'And this is the seventh one document.',
'This document is the eighth.',
'And this is the nineth one document.',
'This document is the second.',
'And this is the tenth one document.',
]
vectorizer = skln.TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
accumulated = [0] * len(vectorizer.get_feature_names())
for i in range(tfidf_matrix.shape[0]):
for j in range(len(vectorizer.get_feature_names())):
accumulated[j] += tfidf_matrix[i][j]
accumulated = sorted(accumulated)[-CENTRAL_TERMS:]
print(accumulated)
我在哪里打印 CENTRAL_TERMS
个单词,这些单词在语料库的所有文档中获得最高的 tf-idf 分数。
不过,我也想获取语料库所有文档的MOST_REPEATED_TERMS
个词。这些是具有最高 tf 分数的词。我知道我可以通过简单地使用 CountVectorizer
来获得,但我只想使用 TfidfVectorizer
(为了不首先执行 vectorizer.fit_transform(corpus)
为 TfidfVectorizer
然后 vectorizer.fit_transform(corpus)
用于 CountVectorizer
。我也知道我可以先使用 CountVectorizer
(获得 tf 分数)然后使用 TfidfTransformer
(获得 tf-idf 分数)。但是,我认为必须有办法只使用 TfidfVectorizer
.
如果有办法做到这一点,请告诉我(欢迎提供任何信息)。
你可以这样完成你的工作
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'This document is the fourth document.',
'And this is the fifth one.',
'This document is the sixth.',
'And this is the seventh one document.',
'This document is the eighth.',
'And this is the nineth one document.',
'This document is the second.',
'And this is the tenth one document.',
]
#define the vectorization model
vectorize = TfidfVectorizer (max_features=2500, min_df=0.1, max_df=0.8)
#pass the corpus into the defined vectorizer
vector_texts = vectorize.fit_transform(corpus).toarray()
vector_texts
- 您必须更改
max_features, min_df, max_df
值才能最适合您的 model.In 我的情况
out[1]:
array([[0. , 0. , 0. ],
[0. , 0. , 1. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 0. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 0. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 0. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 1. ],
[0.70710678, 0.70710678, 0. ]])
默认情况下,TfidfVectorizer
在 tf
和 idf
相乘后进行 l2
归一化。因此,当您有 norm='l2'
时,我们无法获得术语频率。参考 here and here
如果你能在没有规范的情况下工作,那就有解决办法。
import scipy.sparse as sp
import pandas as pd
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
n = len(features)
inverse_idf = sp.diags(1/vectorizer.idf_,
offsets=0,
shape=(n, n),
format='csr',
dtype=np.float64).toarray()
pd.DataFrame(X*inverse_idf,
columns=features)
我有这样一个代码:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'This document is the fourth document.',
'And this is the fifth one.',
'This document is the sixth.',
'And this is the seventh one document.',
'This document is the eighth.',
'And this is the nineth one document.',
'This document is the second.',
'And this is the tenth one document.',
]
vectorizer = skln.TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
accumulated = [0] * len(vectorizer.get_feature_names())
for i in range(tfidf_matrix.shape[0]):
for j in range(len(vectorizer.get_feature_names())):
accumulated[j] += tfidf_matrix[i][j]
accumulated = sorted(accumulated)[-CENTRAL_TERMS:]
print(accumulated)
我在哪里打印 CENTRAL_TERMS
个单词,这些单词在语料库的所有文档中获得最高的 tf-idf 分数。
不过,我也想获取语料库所有文档的MOST_REPEATED_TERMS
个词。这些是具有最高 tf 分数的词。我知道我可以通过简单地使用 CountVectorizer
来获得,但我只想使用 TfidfVectorizer
(为了不首先执行 vectorizer.fit_transform(corpus)
为 TfidfVectorizer
然后 vectorizer.fit_transform(corpus)
用于 CountVectorizer
。我也知道我可以先使用 CountVectorizer
(获得 tf 分数)然后使用 TfidfTransformer
(获得 tf-idf 分数)。但是,我认为必须有办法只使用 TfidfVectorizer
.
如果有办法做到这一点,请告诉我(欢迎提供任何信息)。
你可以这样完成你的工作
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'This document is the fourth document.',
'And this is the fifth one.',
'This document is the sixth.',
'And this is the seventh one document.',
'This document is the eighth.',
'And this is the nineth one document.',
'This document is the second.',
'And this is the tenth one document.',
]
#define the vectorization model
vectorize = TfidfVectorizer (max_features=2500, min_df=0.1, max_df=0.8)
#pass the corpus into the defined vectorizer
vector_texts = vectorize.fit_transform(corpus).toarray()
vector_texts
- 您必须更改
max_features, min_df, max_df
值才能最适合您的 model.In 我的情况
out[1]:
array([[0. , 0. , 0. ],
[0. , 0. , 1. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 0. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 0. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 0. ],
[0.70710678, 0.70710678, 0. ],
[0. , 0. , 1. ],
[0.70710678, 0.70710678, 0. ]])
默认情况下,TfidfVectorizer
在 tf
和 idf
相乘后进行 l2
归一化。因此,当您有 norm='l2'
时,我们无法获得术语频率。参考 here and here
如果你能在没有规范的情况下工作,那就有解决办法。
import scipy.sparse as sp
import pandas as pd
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
n = len(features)
inverse_idf = sp.diags(1/vectorizer.idf_,
offsets=0,
shape=(n, n),
format='csr',
dtype=np.float64).toarray()
pd.DataFrame(X*inverse_idf,
columns=features)