在同一键下查找字典中值的交集 - python

Question

我有一个包含多行的文件，每行以“|”分隔，我想比较每行的第 5 个参数，如果相交，则继续。这是第二部分：

第一个参数1,2按字典比较，如果相同则AND
如果参数 5、6 重叠，然后将这些行连接起来。

如何比较同一键下值的交集？下面的代码可以跨键使用，但不能在同一个键内使用：

from functools import reduce 
reduce(set.intersection, (set(val) for val in query_dict.values()))

下面是行的例子：文本 1|文本 2|文本 3|文本 4 文本 5|文6|文本 7 文本 8| 文本 1|文本 2|文本 12|文本 4 文本 5|文6|正文 7| text9|text10|text3|text4 文本 5|文本 11|文本 12 文本 8|

输出应该是：文本 1|文本 2|文本 3；文本 12|文本 4；文本 5；文本 4；文本 5|文本 6；文本 7 文本 8；文本 6

换句话说，只有那些与第 1、2 个参数（单元格相等）匹配且第 5、6 个参数重叠（交集）的行被连接起来。

这是输入文件：

Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']

输出应如下所示：

Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...

Angela Darvill 被堆叠，因为两条记录共享相同的名称、相同的国家和相同的城市（-ies）。

Answer 1

from itertools import zip_longest


data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""

lines = tuple(line.split('|') for line in data.splitlines())
number_of_lines = len(lines)
print(f"number of lines : {number_of_lines}")
print(f"number of cells in line 1 : {len(lines[0])}")
print(f"number of cells in line 2 : {len(lines[1])}")
print(f"{lines[0]=}")
print(f"{lines[1]=}")

result = []

# we want to compare each line with each other :
for line_a_index, line_a in enumerate(lines):
    for line_b_index, line_b in enumerate(lines[line_a_index+1:]):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
                line_a[5] in line_b[5] or line_a[6] in line_b[6]  # A in B
            or line_b[5] in line_a[5] or line_b[6] in line_a[6]  # B in A
        ):
            result.append(tuple(
                ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
                for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1])  # <-- here I truncated the lines
            ))

# I decided to have a fancy output, but I made some simplifying assumptions to make it simple
if len(result) > 1:
    raise NotImplementedError
widths = tuple(max(len(a) if a is not None else 0, len(b) if b is not None else 0, len(c) if c is not None else 0)
               for a, b, c in zip_longest(lines[0], lines[1], result[0]))
length = max(len(lines[0]), len(lines[1]), len(result[0]))
for line in (lines[0], lines[1], result[0]):
    for index, cell in zip_longest(range(length), line):
        if cell:
            print(cell.ljust(widths[index]), end='|')
    print("", end='\n')  # explicit newline

original_expected_output = "text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6"
print(f"{original_expected_output}         <-- expected")

lenormju_expected_output = "text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7"
print(f"{lenormju_expected_output}             <-- fixed")

输出

number of lines : 2
number of cells in line 1 : 7
number of cells in line 2 : 13
lines[0]=['text1', 'text2', 'text3', 'text4 text 5', ' text 6', ' text 7 text 8', '']
lines[1]=['text1', 'text2', 'text12', 'text4 text 5', ' text 6', ' text 7', ' text9', 'text10', 'text3', 'text4 text 5', ' text 11', ' text 12 text 8', '']
text1|text2|text3       |text4 text 5| text 6| text 7 text 8        |
text1|text2|text12      |text4 text 5| text 6| text 7               | text9|text10|text3|text4 text 5| text 11| text 12 text 8|
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7|
text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6         <-- expected
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7             <-- fixed

编辑:

from dataclasses import dataclass
from itertools import zip_longest


data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""


@dataclass
class Match:  # a convenient way to store the solutions
    line_a_index: int
    line_b_index: int
    line_result: tuple


lines = tuple(line.split('|') for line in data.splitlines())

results = []
for line_a_index, line_a in enumerate(lines):
    for line_b_index, line_b in enumerate(lines[line_a_index+1:], line_a_index+1):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
                line_a[5] in line_b[5] or line_a[6] in line_b[6]  # A in B
            or line_b[5] in line_a[5] or line_b[6] in line_a[6]  # B in A
        ):
            line_result = tuple(
                ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
                for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1])  # <-- here I truncated the lines
            )
            results.append(Match(line_a_index=line_a_index, line_b_index=line_b_index, line_result=line_result))

# simple output of the solution
for result in results:
    print(f"line n°{result.line_a_index} matches with n°{result.line_b_index} : {result.line_result}")

line n°0 matches with n°1 : ('text1', 'text2', 'text3;text12', 'text4 text 5', ' text 6', ' text 7 text 8; text 7')

Answer 2

根据您改进的问题：

import itertools


data = """\
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
"""

lines = tuple(tuple(line.split('|')) for line in data.splitlines())

results = []
for line_a_index, line_a in enumerate(lines):
    # we want to compare each line with each other, so we start at index+1
    for line_b_index, line_b in enumerate(lines[line_a_index+1:], start=line_a_index+1):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        columns0_are_equal = line_a[0] == line_b[0]
        columns1_are_equal = line_a[1] == line_b[1]
        columns3_are_overlap = set(line_a[3]).issubset(set(line_b[3])) or set(line_b[3]).issubset(set(line_a[3]))
        columns4_are_overlap = set(line_a[4]).issubset(set(line_b[4])) or set(line_b[4]).issubset(set(line_a[4]))
        print(f"between lines index={line_a_index} and index={line_b_index}, {columns0_are_equal=} {columns1_are_equal=} {columns3_are_overlap=} {columns4_are_overlap=}")
        if (
            columns0_are_equal
            # and columns1_are_equal
            and (columns3_are_overlap or columns4_are_overlap)
        ):
            print("MATCH!")
            results.append(
                (line_a_index, line_b_index,) + tuple(
                    ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b
                    else cell_a
                    for cell_a, cell_b in itertools.zip_longest(line_a, line_b)
                )
            )

print("Fancy output :")
lines_to_display = set(itertools.chain.from_iterable((lines[result[0]], lines[result[1]], result[2:]) for result in results))
columns_widths = (max(len(str(index)) for result in results for index in (result[0], result[1])),) + tuple(
    max(len(cell) for cell in column)
    for column in zip(*lines_to_display)
)

for width in columns_widths:
    print("-" * width, end="|")
print("")

for result in results:
    for line_index, original_line in zip((result[0], result[1]), (lines[result[0]], lines[result[1]])):
        for column_index, cell in zip(itertools.count(), (str(line_index),) + original_line):
            if cell:
                print(cell.ljust(columns_widths[column_index]), end='|')
        print("", end='\n')  # explicit newline
    for column_index, cell in zip(itertools.count(), ("=",) + result[2:]):
        if cell:
            print(cell.ljust(columns_widths[column_index]), end='|')
    print("", end='\n')  # explicit newline

for width in columns_widths:
    print("-" * width, end="|")
print("")

expected_outputs = """\
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
""".splitlines()

for result, expected_output in itertools.zip_longest(results, expected_outputs):
    actual_output = "|".join(result[2:])
    assert actual_output.startswith(expected_output[:-3])  # minus the "..."

-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
0|Angela Darvill|19036321            |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['GB','US']       |['Salford', 'Eccles', 'Manchester']                                     |
2|Angela Darvill|190323121           |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['US']            |['Brighton', 'Eccles', 'Manchester']                                    |
=|Angela Darvill|19036321;190323121  |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['GB','US'];['US']|['Salford', 'Eccles', 'Manchester'];['Brighton', 'Eccles', 'Manchester']|
1|Helen Stanley |19036320            |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
3|Helen Stanley |19576876320         |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
=|Helen Stanley |19036320;19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|

您可以看到行索引 0 和 2 已经合并，行索引 1 和 3 也是如此。

在同一键下查找字典中值的交集 - python

Find intersection of values in dictionary under a same key - python

python

reduce

dictionary

key

functools