在其他内部字典键的每个组合中搜索内部字典键的每个组合,也在外部字典键的每个组合中搜索
Searching for every combination of inner dict keys in every combination of other inner dict keys, also in every combination of outer dict key
我不确定标题是否很好地描述了我的问题,但如果有什么不对的地方,我会稍后进行编辑。我已经检查了很多与此相关的问题,但是由于代码嵌套太多,我在编程方面不是很有经验,我需要使用 combinations
我无法处理。
我有一个嵌套的字典,类似于:
example_dictionary = {'I want to eat peach and egg.':{'apple':3, 'orange':2, 'banana':5},\
'Peach juice is so delicious.':{'apple':3, 'orange':5, 'banana':2}, \
'Goddamn monkey ate my banana.':{'rice':4, 'apple':6, 'monkey':2}, \
'They say apple is good for health.':{'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
我想做的是按照一些规则构建邻接矩阵。
规则是:
1) 如果任何一个inner dict中的单词存在于任何一个句子(outer dict keys)中,则在相关句子之间添加一个权重作为单词的值。
2) 如果两个句子中的任何一个具有相同的内部字典键(词)但值不同,则将单词的值相乘并添加为相关句子之间的权重。
额外注意:内部字典可以有不同的长度,相同的内部字典键(词)可能有不同的值。我希望它们仅在这种情况下相乘,如果它们具有相同的值我不想考虑。
示例:
Sentence1(0): I want to eat peach and egg. {'apple':3, 'orange':2, 'banana':5}
Sentence2(1): Peach juice is so delicious. {'apple':3, 'orange':5, 'banana':2}
Sentence3(2): Goddamn monkey ate my banana.{'rice':4, 'apple':6, 'monkey':2}
Sentence4(3): They say apple is good for health. {'grape':10, 'monkey':5, 'peach':5, 'egg':8}
Between 0 and 1: 5*2+5*2=20 (because, their apple's has the same value, just multiplied the values for orange and banana. And none of the words exists in any sentence.)
Between 2 and 3: (2*5=10 (monkey is the same key with different value) +
6 (the key of sentence3 'apple' exists in sentence4) +
5 (the key of sentence4 'monkey' exists in sentence3)= 21
Between 0 and 3: 3+5+8=16 (sentence1 key 'apple' exists in sentence4, and sentence4 keys 'egg' and 'peach' exist in sentence1.
我希望这些例子能说明问题。
我尝试了什么(由于嵌套结构和组合,这让我很困惑):
from itertools import combinations, zip_longest
import networkx as nx
def compare_inner_dicts(d1,d2):
#this is for comparing the inner dict keys and multiplying them
#if they have the same key but different value
values = []
inner_values = 0
for common_key in d1.keys() & d2.keys():
if d1[common_key]!= d2[common_key]:
_value = d1[common_key]*d2[common_key]
values.append(_value)
inner_values = sum([p for p in values])
inner_dict_values = inner_values
del inner_values
return inner_dict_values
def build_adj_mat(a_dict):
gr = nx.Graph()
for sentence, words in a_dict.items():
sentences = list(a_dict.keys())
gr.add_nodes_from(sentences)
sentence_pairs = combinations(gr.nodes, 2)
dict_pairs = combinations(a_dict.values(), 2)
for pair, _pair in zip_longest(sentence_pairs, dict_pairs):
numbers = []
x_numbers = []
#y_numbers = []
sentence1 = pair[0]
sentence2 = pair[1]
dict1 = _pair[0]
dict2 = _pair[1]
inner_dict_numbers = compare_inner_dicts(dict1, dict2)
numbers.append(inner_dict_numbers)
for word, num in words.items():
if sentence2.find(word)>-1:
x = words[word]
x_numbers.append(x)
numbers.extend(x_numbers)
# if sentence1.find(word)>-1: #reverse case
# y = words[word]
# y_numbers.append(y)
# numbers.extend(y_numbers)
total = sum([p for p in numbers if len(numbers)>0])
if total>0:
gr.add_edge(sentence1, sentence2, weight=total)
del total
else: del total
else:
continue
numbers.clear()
x_numbers.clear()
#y_numbers.clear()
return gr
G = build_adj_mat(example_dictionary)
print(nx.adjacency_matrix(G))
预期结果:
(0, 1) 5*2+5*2=20
(0, 2) 3*6=18+5=23
(0, 3) 3+5+8=16
(1, 0) 20
(1, 2) 3*6=18+2=20
(1, 3) 3+5=8
(2, 0) 23
(2, 1) 20
(2, 3) 2*5=10+5+6=21
(3, 0) 16
(3, 1) 8
(3, 2) 21
输出:
(0, 2) 23
(0, 3) 6
(1, 2) 23
(1, 3) 6
(2, 0) 23
(2, 1) 23
(2, 3) 16
(3, 0) 6
(3, 1) 6
(3, 2) 16
通过比较预期输出和比较输出,我可以理解其中一个问题,即我的代码只检查 sentence1
中的单词是否存在于 sentence2
中,但不执行撤销。我试图通过使用注释掉的部分来解决它,但它返回了更多无意义的结果。另外我不确定是否还有其他问题。我不知道如何得到正确的结果,这两种组合和嵌套结构让我完全迷失了。很抱歉问了这么长的问题,为了清楚起见,我描述了所有内容。任何帮助将不胜感激,提前致谢。
您可以使用以下功能:
from collections import defaultdict
import itertools as it
import re
def compute_scores(sentence_dict):
scores = defaultdict(int)
for (j, (s1, d1)), (k, (s2, d2)) in it.combinations(enumerate(sentence_dict.items()), 2):
shared_keys = d1.keys() & d2.keys()
scores[j, k] += sum(d1[k]*d2[k] for k in shared_keys if d1[k] != d2[k])
scores[j, k] += sum(d1[k] for k in d1.keys() & get_words(s2))
scores[j, k] += sum(d2[k] for k in d2.keys() & get_words(s1))
return scores
def get_words(sentence):
return set(map(str.lower, re.findall(r'(?<=\b)\w+(?=\b)', sentence)))
结果当然取决于你定义的词,所以你需要在函数中填写你自己的定义get_words
。默认实现似乎适合您的示例数据。由于句子对的分数根据您的定义是对称的,因此无需考虑反向配对(它具有相同的分数);即 (0, 1)
与 (1, 0)
的得分相同。这就是代码使用 itertools.combinations
.
的原因
运行 示例数据:
from pprint import pprint
example_dictionary = {
'I want to eat peach and egg.': {'apple':3, 'orange':2, 'banana':5},
'Peach juice is so delicious.': {'apple':3, 'orange':5, 'banana':2},
'Goddamn monkey ate my banana.': {'rice':4, 'apple':6, 'monkey':2},
'They say apple is good for health.': {'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
pprint(compute_scores(example_dictionary))
给出以下分数:
defaultdict(<class 'int'>,
{(0, 1): 20,
(0, 2): 23,
(0, 3): 16,
(1, 2): 20,
(1, 3): 8,
(2, 3): 21})
如果字典不仅可以包含单词,还可以包含短语(即多个单词),只需对原始实现稍作修改即可(也适用于单个单词):
scores[j, k] += sum(weight for phrase, weight in d1.items() if phrase in s2.lower())
scores[j, k] += sum(weight for phrase, weight in d2.items() if phrase in s1.lower())
我不确定标题是否很好地描述了我的问题,但如果有什么不对的地方,我会稍后进行编辑。我已经检查了很多与此相关的问题,但是由于代码嵌套太多,我在编程方面不是很有经验,我需要使用 combinations
我无法处理。
我有一个嵌套的字典,类似于:
example_dictionary = {'I want to eat peach and egg.':{'apple':3, 'orange':2, 'banana':5},\
'Peach juice is so delicious.':{'apple':3, 'orange':5, 'banana':2}, \
'Goddamn monkey ate my banana.':{'rice':4, 'apple':6, 'monkey':2}, \
'They say apple is good for health.':{'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
我想做的是按照一些规则构建邻接矩阵。 规则是:
1) 如果任何一个inner dict中的单词存在于任何一个句子(outer dict keys)中,则在相关句子之间添加一个权重作为单词的值。
2) 如果两个句子中的任何一个具有相同的内部字典键(词)但值不同,则将单词的值相乘并添加为相关句子之间的权重。
额外注意:内部字典可以有不同的长度,相同的内部字典键(词)可能有不同的值。我希望它们仅在这种情况下相乘,如果它们具有相同的值我不想考虑。
示例:
Sentence1(0): I want to eat peach and egg. {'apple':3, 'orange':2, 'banana':5}
Sentence2(1): Peach juice is so delicious. {'apple':3, 'orange':5, 'banana':2}
Sentence3(2): Goddamn monkey ate my banana.{'rice':4, 'apple':6, 'monkey':2}
Sentence4(3): They say apple is good for health. {'grape':10, 'monkey':5, 'peach':5, 'egg':8}
Between 0 and 1: 5*2+5*2=20 (because, their apple's has the same value, just multiplied the values for orange and banana. And none of the words exists in any sentence.)
Between 2 and 3: (2*5=10 (monkey is the same key with different value) +
6 (the key of sentence3 'apple' exists in sentence4) +
5 (the key of sentence4 'monkey' exists in sentence3)= 21
Between 0 and 3: 3+5+8=16 (sentence1 key 'apple' exists in sentence4, and sentence4 keys 'egg' and 'peach' exist in sentence1.
我希望这些例子能说明问题。
我尝试了什么(由于嵌套结构和组合,这让我很困惑):
from itertools import combinations, zip_longest
import networkx as nx
def compare_inner_dicts(d1,d2):
#this is for comparing the inner dict keys and multiplying them
#if they have the same key but different value
values = []
inner_values = 0
for common_key in d1.keys() & d2.keys():
if d1[common_key]!= d2[common_key]:
_value = d1[common_key]*d2[common_key]
values.append(_value)
inner_values = sum([p for p in values])
inner_dict_values = inner_values
del inner_values
return inner_dict_values
def build_adj_mat(a_dict):
gr = nx.Graph()
for sentence, words in a_dict.items():
sentences = list(a_dict.keys())
gr.add_nodes_from(sentences)
sentence_pairs = combinations(gr.nodes, 2)
dict_pairs = combinations(a_dict.values(), 2)
for pair, _pair in zip_longest(sentence_pairs, dict_pairs):
numbers = []
x_numbers = []
#y_numbers = []
sentence1 = pair[0]
sentence2 = pair[1]
dict1 = _pair[0]
dict2 = _pair[1]
inner_dict_numbers = compare_inner_dicts(dict1, dict2)
numbers.append(inner_dict_numbers)
for word, num in words.items():
if sentence2.find(word)>-1:
x = words[word]
x_numbers.append(x)
numbers.extend(x_numbers)
# if sentence1.find(word)>-1: #reverse case
# y = words[word]
# y_numbers.append(y)
# numbers.extend(y_numbers)
total = sum([p for p in numbers if len(numbers)>0])
if total>0:
gr.add_edge(sentence1, sentence2, weight=total)
del total
else: del total
else:
continue
numbers.clear()
x_numbers.clear()
#y_numbers.clear()
return gr
G = build_adj_mat(example_dictionary)
print(nx.adjacency_matrix(G))
预期结果:
(0, 1) 5*2+5*2=20
(0, 2) 3*6=18+5=23
(0, 3) 3+5+8=16
(1, 0) 20
(1, 2) 3*6=18+2=20
(1, 3) 3+5=8
(2, 0) 23
(2, 1) 20
(2, 3) 2*5=10+5+6=21
(3, 0) 16
(3, 1) 8
(3, 2) 21
输出:
(0, 2) 23
(0, 3) 6
(1, 2) 23
(1, 3) 6
(2, 0) 23
(2, 1) 23
(2, 3) 16
(3, 0) 6
(3, 1) 6
(3, 2) 16
通过比较预期输出和比较输出,我可以理解其中一个问题,即我的代码只检查 sentence1
中的单词是否存在于 sentence2
中,但不执行撤销。我试图通过使用注释掉的部分来解决它,但它返回了更多无意义的结果。另外我不确定是否还有其他问题。我不知道如何得到正确的结果,这两种组合和嵌套结构让我完全迷失了。很抱歉问了这么长的问题,为了清楚起见,我描述了所有内容。任何帮助将不胜感激,提前致谢。
您可以使用以下功能:
from collections import defaultdict
import itertools as it
import re
def compute_scores(sentence_dict):
scores = defaultdict(int)
for (j, (s1, d1)), (k, (s2, d2)) in it.combinations(enumerate(sentence_dict.items()), 2):
shared_keys = d1.keys() & d2.keys()
scores[j, k] += sum(d1[k]*d2[k] for k in shared_keys if d1[k] != d2[k])
scores[j, k] += sum(d1[k] for k in d1.keys() & get_words(s2))
scores[j, k] += sum(d2[k] for k in d2.keys() & get_words(s1))
return scores
def get_words(sentence):
return set(map(str.lower, re.findall(r'(?<=\b)\w+(?=\b)', sentence)))
结果当然取决于你定义的词,所以你需要在函数中填写你自己的定义get_words
。默认实现似乎适合您的示例数据。由于句子对的分数根据您的定义是对称的,因此无需考虑反向配对(它具有相同的分数);即 (0, 1)
与 (1, 0)
的得分相同。这就是代码使用 itertools.combinations
.
运行 示例数据:
from pprint import pprint
example_dictionary = {
'I want to eat peach and egg.': {'apple':3, 'orange':2, 'banana':5},
'Peach juice is so delicious.': {'apple':3, 'orange':5, 'banana':2},
'Goddamn monkey ate my banana.': {'rice':4, 'apple':6, 'monkey':2},
'They say apple is good for health.': {'grape':10, 'monkey':5, 'peach':5, 'egg':8}}
pprint(compute_scores(example_dictionary))
给出以下分数:
defaultdict(<class 'int'>,
{(0, 1): 20,
(0, 2): 23,
(0, 3): 16,
(1, 2): 20,
(1, 3): 8,
(2, 3): 21})
如果字典不仅可以包含单词,还可以包含短语(即多个单词),只需对原始实现稍作修改即可(也适用于单个单词):
scores[j, k] += sum(weight for phrase, weight in d1.items() if phrase in s2.lower())
scores[j, k] += sum(weight for phrase, weight in d2.items() if phrase in s1.lower())