Python: 如何将字数列表转换为适合 CountVectorizer 的格式
Python: how to turn list of word counts into format suitable for CountVectorizer
我有大约 100,000 个以下形式的字符串列表:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145']
等
这基本上构成了我的语料库。每个列表都包含文档中的单词及其单词数。
我怎样才能把这个语料库变成我可以输入 CountVectorizer 的形式?
有没有比将每个列表变成包含 'the' 652 次、'of' 216 次等的字符串更快的方法?
假设您要实现的是稀疏矩阵格式的矢量化语料库,以及经过训练的矢量化器,您可以在不重复数据的情况下模拟矢量化过程:
from scipy.sparse.lil import lil_matrix
from sklearn.feature_extraction.text import CountVectorizer
corpus = [['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'],
['king: 20', 'of: 16', 'the: 400', 'jungle: 110']]
# Prepare a vocabulary for the vectorizer
vocabulary = {item.split(':')[0] for document in corpus for item in document}
indexed_vocabulary = {term: index for index, term in enumerate(vocabulary)}
vectorizer = CountVectorizer(vocabulary=indexed_vocabulary)
# Vectorize the corpus using the coordinates known to the vectorizer
X = lil_matrix((len(corpus), len(vocabulary)))
X.data = [[int(item.split(':')[1]) for item in document] for document in corpus]
X.rows = [[vectorizer.vocabulary[(item.split(':')[0])] for item in document]
for document in corpus]
# Convert the matrix to csr format to be compatible with vectorizer.transform output
X = X.tocsr()
在此示例中,输出将是:
[[ 168. 216. 0. 159. 652. 145. 0.]
[ 0. 16. 110. 0. 400. 0. 20.]]
这可以允许进一步的文档矢量化:
vectorizer.transform(['jungle kid is programming', 'the jungle machine learning jungle'])
产生:
[[0 0 1 0 0 1 0]
[0 0 2 0 1 0 0]]
我有大约 100,000 个以下形式的字符串列表:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145']
等
这基本上构成了我的语料库。每个列表都包含文档中的单词及其单词数。
我怎样才能把这个语料库变成我可以输入 CountVectorizer 的形式?
有没有比将每个列表变成包含 'the' 652 次、'of' 216 次等的字符串更快的方法?
假设您要实现的是稀疏矩阵格式的矢量化语料库,以及经过训练的矢量化器,您可以在不重复数据的情况下模拟矢量化过程:
from scipy.sparse.lil import lil_matrix
from sklearn.feature_extraction.text import CountVectorizer
corpus = [['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'],
['king: 20', 'of: 16', 'the: 400', 'jungle: 110']]
# Prepare a vocabulary for the vectorizer
vocabulary = {item.split(':')[0] for document in corpus for item in document}
indexed_vocabulary = {term: index for index, term in enumerate(vocabulary)}
vectorizer = CountVectorizer(vocabulary=indexed_vocabulary)
# Vectorize the corpus using the coordinates known to the vectorizer
X = lil_matrix((len(corpus), len(vocabulary)))
X.data = [[int(item.split(':')[1]) for item in document] for document in corpus]
X.rows = [[vectorizer.vocabulary[(item.split(':')[0])] for item in document]
for document in corpus]
# Convert the matrix to csr format to be compatible with vectorizer.transform output
X = X.tocsr()
在此示例中,输出将是:
[[ 168. 216. 0. 159. 652. 145. 0.]
[ 0. 16. 110. 0. 400. 0. 20.]]
这可以允许进一步的文档矢量化:
vectorizer.transform(['jungle kid is programming', 'the jungle machine learning jungle'])
产生:
[[0 0 1 0 0 1 0]
[0 0 2 0 1 0 0]]