如何获取给定文档的 tfidf 向量
How to get the tfidf vector for a given document
我有以下文件:
id review
1 "Human machine interface for lab abc computer applications."
2 "A survey of user opinion of computer system response time."
3 "The EPS user interface management system."
4 "System and human system engineering testing of EPS."
5 "Relation of user perceived response time to error measurement."
6 "The generation of random binary unordered trees."
7 "The intersection graph of paths in trees."
8 "Graph minors IV Widths of trees and well quasi ordering."
9 "Graph minors A survey."
10 "survey is a state of art."
每一行对应一个文档。
我将这些文档转换为语料库,并为每个词找到它的 TFIDF:
from collections import defaultdict
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
reviews = defaultdict(list)
with open("C:/Users/user/workspacePython/Tutorial/data/unlabeledTrainData.tsv", "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter='\t')
reader.next()
for row in reader:
reviews[row[1]].append(row[1])
for id, review in reviews.iteritems():
reviews[id] = " ".join(review)
corpus = []
for id, review in sorted(reviews.iteritems(), key=lambda t: id):
corpus.append(review)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
我的问题是:如何为给定文档(从上述文件中)获取其在 tfidf_matrix.
中的相应向量(行)
谢谢
您有一个文档列表,从 1 到 10。这是数组索引术语中的 0 到 9。
变量 tfidx_matrix
将包含一个稀疏行形式的矩阵,该矩阵由行(代表文档)及其与整个语料库中的词汇表(减去英语停用词)的规范化关联组成。
所以要将稀疏数组转换为更传统的矩阵,您可以尝试
npm_tfidf = tfidf_matrix.todense()
document_1_vector = npm_tfidf[0]
document_2_vector = npm_tfidf[1]
document_3_vector = npm_tfidf[2]
...
document_10_vector = npm_tfidf[9]
有更简单更好的方法来提取内容,但我想阻碍您的部分是从稀疏矩阵表示的这种转换,它可能很难取消选择,以及更传统的密集矩阵表示。
另请注意,解释向量需要您能够提取在此过程中提取的词汇 - 这应该采用有序的形式(按字母顺序排列的标记列表)并且可以使用以下方法提取:
vocabulary = tfidf_matrix.get_feature_names()
我有以下文件:
id review
1 "Human machine interface for lab abc computer applications."
2 "A survey of user opinion of computer system response time."
3 "The EPS user interface management system."
4 "System and human system engineering testing of EPS."
5 "Relation of user perceived response time to error measurement."
6 "The generation of random binary unordered trees."
7 "The intersection graph of paths in trees."
8 "Graph minors IV Widths of trees and well quasi ordering."
9 "Graph minors A survey."
10 "survey is a state of art."
每一行对应一个文档。
我将这些文档转换为语料库,并为每个词找到它的 TFIDF:
from collections import defaultdict
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
reviews = defaultdict(list)
with open("C:/Users/user/workspacePython/Tutorial/data/unlabeledTrainData.tsv", "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter='\t')
reader.next()
for row in reader:
reviews[row[1]].append(row[1])
for id, review in reviews.iteritems():
reviews[id] = " ".join(review)
corpus = []
for id, review in sorted(reviews.iteritems(), key=lambda t: id):
corpus.append(review)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
我的问题是:如何为给定文档(从上述文件中)获取其在 tfidf_matrix.
中的相应向量(行)谢谢
您有一个文档列表,从 1 到 10。这是数组索引术语中的 0 到 9。
变量 tfidx_matrix
将包含一个稀疏行形式的矩阵,该矩阵由行(代表文档)及其与整个语料库中的词汇表(减去英语停用词)的规范化关联组成。
所以要将稀疏数组转换为更传统的矩阵,您可以尝试
npm_tfidf = tfidf_matrix.todense()
document_1_vector = npm_tfidf[0]
document_2_vector = npm_tfidf[1]
document_3_vector = npm_tfidf[2]
...
document_10_vector = npm_tfidf[9]
有更简单更好的方法来提取内容,但我想阻碍您的部分是从稀疏矩阵表示的这种转换,它可能很难取消选择,以及更传统的密集矩阵表示。
另请注意,解释向量需要您能够提取在此过程中提取的词汇 - 这应该采用有序的形式(按字母顺序排列的标记列表)并且可以使用以下方法提取:
vocabulary = tfidf_matrix.get_feature_names()