如何防止 python 中的重复文本（在字典列表中）出现重复的 UUID？

Question

我必须通过检查人名是否出现在文本中来过滤我处理的文本 (texts)。如果它们确实出现，文本将作为嵌套的词典列表附加到包含人名的现有词典列表中 (people)。但是，由于在某些文本中出现了不止一个人的名字，因此包含这些文本的子文档将被重复并再次添加。因此，子文档不包含唯一 ID，无论重复的文本如何，这个唯一 ID 都非常重要。

有没有更聪明的方法来添加一个唯一的 ID，即使文本是重复的？

我的代码：

import uuid

people = [{'id': 1,
  'name': 'Bob',
  'type': 'person',
  '_childDocuments_': [{'text': 'text_replace'}]},
 {'id': 2,
  'name': 'Kate',
  'type': 'person',
  '_childDocuments_': [{'text': 'text_replace'}]},
 {'id': 3,
  'name': 'Joe',
  'type': 'person',
  '_childDocuments_': [{'text': 'text_replace'}]}]


texts = ['this text has the name Bob and Kate',
        'this text has the name Kate only ']


for text in texts: 
    childDoc={'id': str(uuid.uuid1()), #the id will duplicate when files are repeated
         'text': text}
    for person in people:
        if person['name'] in childDoc['text']:
            person['_childDocuments_'].append(childDoc)

当前输出：

[{'id': 1,
  'name': 'Bob',
  'type': 'person',
  '_childDocuments_': [{'text': 'text_replace'},
   {'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
    'text': 'this text has the name Bob and Kate'}]},
 {'id': 2,
  'name': 'Kate',
  'type': 'person',
  '_childDocuments_': [{'text': 'text_replace'},
   {'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
    'text': 'this text has the name Bob and Kate'},
   {'id': '77525980-410f-11eb-b667-9cb6d0897972',
    'text': 'this text has the name Kate only '}]},
 {'id': 3,
  'name': 'Joe',
  'type': 'person',
  '_childDocuments_': [{'text': 'text_replace'}]}]

正如您在当前输出中看到的那样，文本 'this text has the name Bob and Kate' 的 ID 具有相同的标识符： '7752597f-410f-11eb-9341-9cb6d0897972' ，因为它附加了两次。但我希望每个标识符都不同。

期望输出：

与当前输出相同，只是我们希望每个附加文本的每个 ID 都不同，即使这些文本是 same/duplicates.

Answer 1

将UUID的生成移到内循环中：

for text in texts:
    for person in people:
        if person['name'] in text:
            childDoc={'id': str(uuid.uuid1()),
                      'text': text}
            person['_childDocuments_'].append(childDoc)

这实际上并不能确保 UUID 是唯一的。为此，你需要有一组使用过的 UUID，当生成一个新的 UUID 时，你检查它是否已经被使用，如果是你生成另一个。然后测试那个并重复，直到你用尽 UUID space 或找到一个未使用的 UUID。

有 2**61 分之一的机会生成重复项。我不能接受冲突，因为它们会导致数据丢失。因此，当我使用 UUID 时，我在生成器周围有一个循环，如下所示：

used = set()

while True:
   identifier = str(uuid.uuid1())
   if identifier not in used:
       used.add(identifier)
       break

使用过的套装居然是永久保存的。我不喜欢这段代码，尽管我有一个使用它的程序，因为它在找不到未使用的 UUID 时会陷入无限循环。

一些文档数据库提供自动 UUID 分配，它们在内部为您执行此操作，以确保给定的数据库实例永远不会以两个具有相同 UUID 的文档结束。

如何防止 python 中的重复文本（在字典列表中）出现重复的 UUID？

How to prevent duplicate UUID's on duplicate texts (in list of dicts) in python?

python

uuid

json

nested

duplicates