Python 如何创建一个文档矩阵,其中 (i,j) 个条目是术语索引
Python how to create a document matrix with (i,j) entries being term index
我运行进入以下文本数据矩阵操作问题。
我也有原始文本文档,存储在一个列表中。下面是文本数据列表的第一个元素的示例。
text_data[1]
u"\n The Bechtel Group Inc. offered in 1985 to sell oil to Israel at a
discount of at least 650 million for 10 years if it promised not to
bomb a proposed Iraqi pipeline, a Foreign Ministry official said
Wednesday. But then-Prime Minister Shimon Peres said the offer from
Bruce Rappaport, a partner in the San Francisco-based construction and
engineering company, was ``unimportant,'' the senior official told The
Associated Press. Peres, now foreign minister, never discussed the
offer with other government ministers, said the official, who spoke on
condition of anonymity.
我希望得到一个矩阵,其中 x_{ij} 表示第 i 个文档中第 j 个定位词的术语索引。示例如下:
Words W = np.array([0, 1, 2, 3, 4]) # word indices for a dictionary of words
# D := document words X = np.array([
[0, 0, 1, 2, 2], # e.g., this row means 1st, and 2nd position is the first term in the dictionary, etc.
[0, 0, 1, 1, 1],
[0, 1, 2, 2, 2],
[4, 4, 4, 4, 4],
[3, 3, 4, 4, 4],
[3, 4, 4, 4, 4]
])
我能想到的是先把语料库中的词条建立一个字典,并有对应的索引。然后遍历每个文档,遍历整个文档,将出现在文档 i 和位置 j 的词放入词条索引。但这似乎很冗长且效率低下。
我 运行 几个月前参加过类似的挑战。我很确定有一种方法可以使用 Python NLTK 来完成。谷歌搜索 "corpus to term count vectors" 应该会给你一个好的开始。
不过,正如您在问题中所建议的那样,我最终只是实施了自己的方法。
def document_to_term_counts(document, vocab):
term_count = [0] * len(vocab)
for word in document:
if word in vocab:
term_count[vocab.index(word)] += 1
return term_count
def count_words_in_documents(documents):
word_counts = {}
for document in documents:
words_found_in_document = set()
for word in document:
if word not in word_counts:
word_counts[word] = {'all_appearances': 1, 'document_appearances': 1}
else:
word_counts[word]['all_appearances'] += 1
if word not in words_found_in_document:
word_counts[word]['document_appearances'] += 1
words_found_in_document.add(word)
return word_counts
def word_counts_to_vocab(word_counts, min_document_apperances, max_document_apperances):
vocab = []
for word in word_counts:
document_apperances = word_counts[word]['document_appearances']
if document_apperances >= min_document_apperances and document_apperances <= max_document_apperances:
vocab.append(word)
return vocab
def documents_to_vocab(documents, min_document_apperances, max_document_apperances):
word_counts = count_words_in_documents(documents)
vocab = word_counts_to_vocab(word_counts, min_document_apperances, max_document_apperances)
return vocab
documents = [
['the', 'quick', 'brown', 'fox', 'jumped'],
['foxes', 'are', 'quick']
]
vocab = documents_to_vocab(documents, 1, 100)
print('vocabulary:')
print(vocab)
for document in documents:
term_counts = document_to_term_counts(document, vocab)
print('-'*50)
print(document)
print(term_counts)
我运行进入以下文本数据矩阵操作问题。
我也有原始文本文档,存储在一个列表中。下面是文本数据列表的第一个元素的示例。
text_data[1]
u"\n The Bechtel Group Inc. offered in 1985 to sell oil to Israel at a
discount of at least 650 million for 10 years if it promised not to
bomb a proposed Iraqi pipeline, a Foreign Ministry official said
Wednesday. But then-Prime Minister Shimon Peres said the offer from
Bruce Rappaport, a partner in the San Francisco-based construction and
engineering company, was ``unimportant,'' the senior official told The
Associated Press. Peres, now foreign minister, never discussed the
offer with other government ministers, said the official, who spoke on
condition of anonymity.
我希望得到一个矩阵,其中 x_{ij} 表示第 i 个文档中第 j 个定位词的术语索引。示例如下:
Words W = np.array([0, 1, 2, 3, 4]) # word indices for a dictionary of words
# D := document words X = np.array([
[0, 0, 1, 2, 2], # e.g., this row means 1st, and 2nd position is the first term in the dictionary, etc.
[0, 0, 1, 1, 1],
[0, 1, 2, 2, 2],
[4, 4, 4, 4, 4],
[3, 3, 4, 4, 4],
[3, 4, 4, 4, 4]
])
我能想到的是先把语料库中的词条建立一个字典,并有对应的索引。然后遍历每个文档,遍历整个文档,将出现在文档 i 和位置 j 的词放入词条索引。但这似乎很冗长且效率低下。
我 运行 几个月前参加过类似的挑战。我很确定有一种方法可以使用 Python NLTK 来完成。谷歌搜索 "corpus to term count vectors" 应该会给你一个好的开始。
不过,正如您在问题中所建议的那样,我最终只是实施了自己的方法。
def document_to_term_counts(document, vocab):
term_count = [0] * len(vocab)
for word in document:
if word in vocab:
term_count[vocab.index(word)] += 1
return term_count
def count_words_in_documents(documents):
word_counts = {}
for document in documents:
words_found_in_document = set()
for word in document:
if word not in word_counts:
word_counts[word] = {'all_appearances': 1, 'document_appearances': 1}
else:
word_counts[word]['all_appearances'] += 1
if word not in words_found_in_document:
word_counts[word]['document_appearances'] += 1
words_found_in_document.add(word)
return word_counts
def word_counts_to_vocab(word_counts, min_document_apperances, max_document_apperances):
vocab = []
for word in word_counts:
document_apperances = word_counts[word]['document_appearances']
if document_apperances >= min_document_apperances and document_apperances <= max_document_apperances:
vocab.append(word)
return vocab
def documents_to_vocab(documents, min_document_apperances, max_document_apperances):
word_counts = count_words_in_documents(documents)
vocab = word_counts_to_vocab(word_counts, min_document_apperances, max_document_apperances)
return vocab
documents = [
['the', 'quick', 'brown', 'fox', 'jumped'],
['foxes', 'are', 'quick']
]
vocab = documents_to_vocab(documents, 1, 100)
print('vocabulary:')
print(vocab)
for document in documents:
term_counts = document_to_term_counts(document, vocab)
print('-'*50)
print(document)
print(term_counts)