合并多个文档的词袋
Combine bag of words for multiple documents
我有多个文档,对于这个例子来说,假设是 3 个。
它们每个都包含几个不同的单词,由 space 分隔。现在我想计算每个文档的所有单词并将其放入矩阵或数据框中。所以我将特定文档作为一行,每个单词作为一列,出现次数作为数据框中的数字。请参阅下面的示例
Doc1 = "a b c d"
Doc2 = "a c e f"
Doc3 = "a e f f"
data = {'a': [1,1,1],
'b': [1,0,0],
'c': [1,1,0],
'd': [1,0,0],
'e': [0,1,1],
'f': [0,1,2],
}
df = pd.DataFrame (data)
doc1 = "a b c d"
doc2 = "a c e f"
doc3 = "a e f f"
docs = [doc1, doc2, doc3]
data = {}
for i, doc in enumerate(docs):
for word in doc.split():
val = data.get(word, [0] * len(docs))
val[i] += 1
data[word] = val
print(data)
# Alternative
data = {}
for i, doc in enumerate(docs):
for word in doc.split():
try:
data[word][i] += 1
except KeyError:
data[word] = [0 if i != j else 1 for j in range(len(docs))]
print(data)
我有多个文档,对于这个例子来说,假设是 3 个。 它们每个都包含几个不同的单词,由 space 分隔。现在我想计算每个文档的所有单词并将其放入矩阵或数据框中。所以我将特定文档作为一行,每个单词作为一列,出现次数作为数据框中的数字。请参阅下面的示例
Doc1 = "a b c d"
Doc2 = "a c e f"
Doc3 = "a e f f"
data = {'a': [1,1,1],
'b': [1,0,0],
'c': [1,1,0],
'd': [1,0,0],
'e': [0,1,1],
'f': [0,1,2],
}
df = pd.DataFrame (data)
doc1 = "a b c d"
doc2 = "a c e f"
doc3 = "a e f f"
docs = [doc1, doc2, doc3]
data = {}
for i, doc in enumerate(docs):
for word in doc.split():
val = data.get(word, [0] * len(docs))
val[i] += 1
data[word] = val
print(data)
# Alternative
data = {}
for i, doc in enumerate(docs):
for word in doc.split():
try:
data[word][i] += 1
except KeyError:
data[word] = [0 if i != j else 1 for j in range(len(docs))]
print(data)