在 python 中生成非重复列表

Question

我正在尝试通过使用 python-docx 库遍历所有文档元素来生成一个列表，其中在 w:hyperlink 元素中包含 anchor 个名称，代码如下：

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
    return list(set(hyperlinks_in_document))

上面的代码 returns 一个带锚点的列表发现我遇到的问题是当一个文本被分成多个运行时因此“从循环到元素中生成”的列表可以有重复的名称和输出是这样的：

['American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian']

我尝试了 here 中的这些代码，但仍然存在重复问题或代码性能受到影响，但此处的代码：

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    returned_links = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
            [returned_links.append(element_in_list) for element_in_list in hyperlinks_in_document
             if element_in_list not in returned_links]
    return returned_links

解决了重复问题，但性能受到影响。有什么有用的想法吗？

Answer 1

return 列表(dict.fromkeys(youdublicated_list))

Answer 2

我对之前的代码进行了更改，并想出了将最终列表切换为 set 因此我用更少的时间得到了 non-duplicate 个项目：

def get_hyperlinks(docx__document):    
    hyperlinks, returned_links = list(), set()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks = element._p.getparent().xpath('.//w:hyperlink')
    hyperlinks = [str(hyperlink.get(qn("w:anchor"))) for hyperlink in hyperlinks]
    returned_links = list(set().union(hyperlinks))
    # [returned_links.append(element_in_list) for element_in_list in hyperlinks
    #          if element_in_list not in returned_links]
    return returned_links

注释行显示了我之前所做的，整个答案就是最终代码。

在 python 中生成非重复列表

Generate non-duplicate list in python

python

list

python-docx