在其他内部字典键的每个组合中搜索内部字典键的每个组合，也在外部字典键的每个组合中搜索

Question

我不确定标题是否很好地描述了我的问题，但如果有什么不对的地方，我会稍后进行编辑。我已经检查了很多与此相关的问题，但是由于代码嵌套太多，我在编程方面不是很有经验，我需要使用 combinations 我无法处理。

我有一个嵌套的字典，类似于：

example_dictionary = {'I want to eat peach and egg.':{'apple':3, 'orange':2, 'banana':5},\
                   'Peach juice is so delicious.':{'apple':3, 'orange':5, 'banana':2}, \
'Goddamn monkey ate my banana.':{'rice':4, 'apple':6, 'monkey':2}, \
'They say apple is good for health.':{'grape':10, 'monkey':5, 'peach':5, 'egg':8}}

我想做的是按照一些规则构建邻接矩阵。规则是：

1) 如果任何一个inner dict中的单词存在于任何一个句子(outer dict keys)中，则在相关句子之间添加一个权重作为单词的值。

2) 如果两个句子中的任何一个具有相同的内部字典键（词）但值不同，则将单词的值相乘并添加为相关句子之间的权重。

额外注意：内部字典可以有不同的长度，相同的内部字典键（词）可能有不同的值。我希望它们仅在这种情况下相乘，如果它们具有相同的值我不想考虑。

示例：

Sentence1(0): I want to eat peach and egg. {'apple':3, 'orange':2, 'banana':5}

Sentence2(1): Peach juice is so delicious. {'apple':3, 'orange':5, 'banana':2}

Sentence3(2): Goddamn monkey ate my banana.{'rice':4, 'apple':6, 'monkey':2}

Sentence4(3): They say apple is good for health. {'grape':10, 'monkey':5, 'peach':5, 'egg':8}

Between 0 and 1: 5*2+5*2=20 (because, their apple's has the same value, just multiplied the values for orange and banana. And none of the words exists in any sentence.)

Between 2 and 3: (2*5=10 (monkey is the same key with different value) +

6 (the key of sentence3 'apple' exists in sentence4) +

5 (the key of sentence4 'monkey' exists in sentence3)= 21

Between 0 and 3: 3+5+8=16 (sentence1 key 'apple' exists in sentence4, and sentence4 keys 'egg' and 'peach' exist in sentence1.

我希望这些例子能说明问题。

我尝试了什么（由于嵌套结构和组合，这让我很困惑）：

from itertools import combinations, zip_longest
import networkx as nx

def compare_inner_dicts(d1,d2):
#this is for comparing the inner dict keys and multiplying them
#if they have the same key but different value
    values = []
    inner_values = 0
    for common_key in d1.keys() & d2.keys():
        if d1[common_key]!= d2[common_key]:
            _value = d1[common_key]*d2[common_key]
            values.append(_value)
            inner_values = sum([p for p in values])

    inner_dict_values = inner_values
    del inner_values  

    return inner_dict_values


def build_adj_mat(a_dict):
    gr = nx.Graph()
    for sentence, words in a_dict.items():

        sentences = list(a_dict.keys())
        gr.add_nodes_from(sentences)
        sentence_pairs = combinations(gr.nodes, 2)
        dict_pairs = combinations(a_dict.values(), 2)
        for pair, _pair in zip_longest(sentence_pairs, dict_pairs):
            numbers = []
            x_numbers = []
            #y_numbers = []
            sentence1 = pair[0]
            sentence2 = pair[1]
            dict1 = _pair[0]
            dict2 = _pair[1]

            inner_dict_numbers = compare_inner_dicts(dict1, dict2)
            numbers.append(inner_dict_numbers)

            for word, num in words.items():
                if sentence2.find(word)>-1:
                    x = words[word]
                    x_numbers.append(x)
                    numbers.extend(x_numbers)
#                if sentence1.find(word)>-1: #reverse case
#                    y = words[word]
#                    y_numbers.append(y)
#                    numbers.extend(y_numbers)

                    total = sum([p for p in numbers if len(numbers)>0])

                    if total>0:
                        gr.add_edge(sentence1, sentence2, weight=total)
                        del total
                    else: del total
                else: 
                    continue
                    numbers.clear()
                    x_numbers.clear()
                   #y_numbers.clear()

    return gr

G = build_adj_mat(example_dictionary)
print(nx.adjacency_matrix(G))

预期结果：

(0, 1) 5*2+5*2=20
(0, 2) 3*6=18+5=23
(0, 3) 3+5+8=16
(1, 0) 20
(1, 2) 3*6=18+2=20
(1, 3) 3+5=8
(2, 0) 23
(2, 1) 20
(2, 3) 2*5=10+5+6=21
(3, 0) 16
(3, 1) 8
(3, 2) 21

输出：

  (0, 2)        23
  (0, 3)        6
  (1, 2)        23
  (1, 3)        6
  (2, 0)        23
  (2, 1)        23
  (2, 3)        16
  (3, 0)        6
  (3, 1)        6
  (3, 2)        16

通过比较预期输出和比较输出，我可以理解其中一个问题，即我的代码只检查 sentence1 中的单词是否存在于 sentence2 中，但不执行撤销。我试图通过使用注释掉的部分来解决它，但它返回了更多无意义的结果。另外我不确定是否还有其他问题。我不知道如何得到正确的结果，这两种组合和嵌套结构让我完全迷失了。很抱歉问了这么长的问题，为了清楚起见，我描述了所有内容。任何帮助将不胜感激，提前致谢。

Answer 1

您可以使用以下功能：

from collections import defaultdict
import itertools as it
import re


def compute_scores(sentence_dict):
    scores = defaultdict(int)
    for (j, (s1, d1)), (k, (s2, d2)) in it.combinations(enumerate(sentence_dict.items()), 2):
        shared_keys = d1.keys() & d2.keys()
        scores[j, k] += sum(d1[k]*d2[k] for k in shared_keys if d1[k] != d2[k])
        scores[j, k] += sum(d1[k] for k in d1.keys() & get_words(s2))
        scores[j, k] += sum(d2[k] for k in d2.keys() & get_words(s1))
    return scores


def get_words(sentence):
    return set(map(str.lower, re.findall(r'(?<=\b)\w+(?=\b)', sentence)))

结果当然取决于你定义的词，所以你需要在函数中填写你自己的定义get_words。默认实现似乎适合您的示例数据。由于句子对的分数根据您的定义是对称的，因此无需考虑反向配对（它具有相同的分数）；即 (0, 1) 与 (1, 0) 的得分相同。这就是代码使用 itertools.combinations.

的原因

运行示例数据：

from pprint import pprint

example_dictionary = {
    'I want to eat peach and egg.': {'apple':3, 'orange':2, 'banana':5},
    'Peach juice is so delicious.': {'apple':3, 'orange':5, 'banana':2},
    'Goddamn monkey ate my banana.': {'rice':4, 'apple':6, 'monkey':2},
    'They say apple is good for health.': {'grape':10, 'monkey':5, 'peach':5, 'egg':8}}

pprint(compute_scores(example_dictionary))

给出以下分数：

defaultdict(<class 'int'>,
            {(0, 1): 20,
             (0, 2): 23,
             (0, 3): 16,
             (1, 2): 20,
             (1, 3): 8,
             (2, 3): 21})

如果字典不仅可以包含单词，还可以包含短语（即多个单词），只需对原始实现稍作修改即可（也适用于单个单词）：

scores[j, k] += sum(weight for phrase, weight in d1.items() if phrase in s2.lower())
scores[j, k] += sum(weight for phrase, weight in d2.items() if phrase in s1.lower())

在其他内部字典键的每个组合中搜索内部字典键的每个组合，也在外部字典键的每个组合中搜索

Searching for every combination of inner dict keys in every combination of other inner dict keys, also in every combination of outer dict key

python

combinations

dictionary

graph

adjacency-matrix