为具有重复项的字符串列表生成唯一 ID
Generate unique IDs for a list of strings with duplicates
我想为从文本文件中读取的字符串生成 ID。如果字符串重复,我希望字符串的第一个实例具有包含 6 个字符的 ID。对于该字符串的副本,我希望 ID 与原始字符串相同,但多了两个字符。我的逻辑有问题。这是我到目前为止所做的:
from itertools import groupby
import uuid
f = open('test.txt', 'r')
addresses = f.readlines()
list_of_addresses = ['Address']
list_of_ids = ['ID']
for x in addresses:
list_of_addresses.append(x)
def find_duplicates():
for x, y in groupby(sorted(list_of_addresses)):
id = str(uuid.uuid4().get_hex().upper()[0:6])
j = len(list(y))
if j > 1:
print str(j) + " instances of " + x
list_of_ids.append(id)
print list_of_ids
find_duplicates()
我应该如何处理这个问题?
编辑:这里是test.txt
的内容:
123 Test
123 Test
123 Test
321 Test
567 Test
567 Test
并且输出:
3 occurences of 123 Test
['ID', 'C10DD8']
['ID', 'C10DD8']
2 occurences of 567 Test
['ID', 'C10DD8', '595C5E']
['ID', 'C10DD8', '595C5E']
If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.
尝试使用 collections.defaultdict
.
给定
import ctypes
import collections as ct
filename = "test.txt"
def read_file(fname):
"""Read lines from a file."""
with open(fname, "r") as f:
for line in f:
yield line.strip()
代码
dd = ct.defaultdict(list)
for x in read_file(filename):
key = str(ctypes.c_size_t(hash(x)).value) # make positive hashes
if key[:6] not in dd:
dd[key[:6]].append(x)
else:
dd[key[:8]].append(x)
dd
输出
defaultdict(list,
{'133259': ['123 Test'],
'13325942': ['123 Test', '123 Test'],
'210763': ['567 Test'],
'21076377': ['567 Test'],
'240895': ['321 Test']})
生成的字典中每个第一次出现的唯一行都有键(长度为 6)。对于每个连续的复制行,两个额外的字符被分割为键。
您可以根据需要实现按键。在这种情况下,我们使用 hash()
将键关联到每个唯一行。然后我们从密钥中切出所需的序列。另见 post 关于制作 positive hash values from ctypes
。
要检查您的结果,请从 defaultdict
.
创建适当的查找词典
# Lookups
occurrences = ct.defaultdict(int)
ids = ct.defaultdict(list)
for k, v in dd.items():
key = v[0]
occurrences[key] += len(v)
ids[key].append(k)
# View data
for k, v in occurrences.items():
print("{} instances of {}".format(v, k))
print("IDs:", ids[k])
print()
输出
1 instances of 321 Test
IDs: ['240895']
2 instances of 567 Test
IDs: ['21076377', '210763']
3 instances of 123 Test
IDs: ['13325942', '133259']
你的问题有点令人困惑,我不明白什么是生成 id 的标准,这里我只向你展示逻辑而不是精确的解决方案,你可以从逻辑中寻求帮助
track={}
with open('file.txt') as f:
for line_no,line in enumerate(f):
if line.split()[0] not in track:
track[line.split()[0]]=[['ID','your_unique_id']]
else:
#here put your logic what you want to append if id is dublicate
track[line.split()[0]].append(['ID','dublicate_id'+str(line_no)])
print(track)
输出:
{'123': [['ID', 'your_unique_id'], ['ID', 'dublicate_id1'], ['ID', 'dublicate_id2']], '321': [['ID', 'your_unique_id']], '567': [['ID', 'your_unique_id'], ['ID', 'dublicate_id5']]}
我想为从文本文件中读取的字符串生成 ID。如果字符串重复,我希望字符串的第一个实例具有包含 6 个字符的 ID。对于该字符串的副本,我希望 ID 与原始字符串相同,但多了两个字符。我的逻辑有问题。这是我到目前为止所做的:
from itertools import groupby
import uuid
f = open('test.txt', 'r')
addresses = f.readlines()
list_of_addresses = ['Address']
list_of_ids = ['ID']
for x in addresses:
list_of_addresses.append(x)
def find_duplicates():
for x, y in groupby(sorted(list_of_addresses)):
id = str(uuid.uuid4().get_hex().upper()[0:6])
j = len(list(y))
if j > 1:
print str(j) + " instances of " + x
list_of_ids.append(id)
print list_of_ids
find_duplicates()
我应该如何处理这个问题?
编辑:这里是test.txt
的内容:
123 Test
123 Test
123 Test
321 Test
567 Test
567 Test
并且输出:
3 occurences of 123 Test
['ID', 'C10DD8']
['ID', 'C10DD8']
2 occurences of 567 Test
['ID', 'C10DD8', '595C5E']
['ID', 'C10DD8', '595C5E']
If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.
尝试使用 collections.defaultdict
.
给定
import ctypes
import collections as ct
filename = "test.txt"
def read_file(fname):
"""Read lines from a file."""
with open(fname, "r") as f:
for line in f:
yield line.strip()
代码
dd = ct.defaultdict(list)
for x in read_file(filename):
key = str(ctypes.c_size_t(hash(x)).value) # make positive hashes
if key[:6] not in dd:
dd[key[:6]].append(x)
else:
dd[key[:8]].append(x)
dd
输出
defaultdict(list,
{'133259': ['123 Test'],
'13325942': ['123 Test', '123 Test'],
'210763': ['567 Test'],
'21076377': ['567 Test'],
'240895': ['321 Test']})
生成的字典中每个第一次出现的唯一行都有键(长度为 6)。对于每个连续的复制行,两个额外的字符被分割为键。
您可以根据需要实现按键。在这种情况下,我们使用 hash()
将键关联到每个唯一行。然后我们从密钥中切出所需的序列。另见 post 关于制作 positive hash values from ctypes
。
要检查您的结果,请从 defaultdict
.
# Lookups
occurrences = ct.defaultdict(int)
ids = ct.defaultdict(list)
for k, v in dd.items():
key = v[0]
occurrences[key] += len(v)
ids[key].append(k)
# View data
for k, v in occurrences.items():
print("{} instances of {}".format(v, k))
print("IDs:", ids[k])
print()
输出
1 instances of 321 Test
IDs: ['240895']
2 instances of 567 Test
IDs: ['21076377', '210763']
3 instances of 123 Test
IDs: ['13325942', '133259']
你的问题有点令人困惑,我不明白什么是生成 id 的标准,这里我只向你展示逻辑而不是精确的解决方案,你可以从逻辑中寻求帮助
track={}
with open('file.txt') as f:
for line_no,line in enumerate(f):
if line.split()[0] not in track:
track[line.split()[0]]=[['ID','your_unique_id']]
else:
#here put your logic what you want to append if id is dublicate
track[line.split()[0]].append(['ID','dublicate_id'+str(line_no)])
print(track)
输出:
{'123': [['ID', 'your_unique_id'], ['ID', 'dublicate_id1'], ['ID', 'dublicate_id2']], '321': [['ID', 'your_unique_id']], '567': [['ID', 'your_unique_id'], ['ID', 'dublicate_id5']]}