将非结构化名称和数据列表转换为嵌套字典
Converting an unstructured list of names and data to nested dictionary
我有一个 "unstructured" 列表,如下所示:
info = [
'Joe Schmoe',
'W / M / 64',
'Richard Johnson',
'OFFICER',
'W / M /48',
'Adrian Stevens',
'? / ? / 27'
]
非结构化,因为列表由以下集合组成:
- (姓名、官员身份、人口统计信息) 三胞胎,或
- (姓名,人口统计信息) 对。
在后一种情况下,Officer=False
,在前一种情况下,Officer=True
。人口统计信息字符串表示 Race / Gender / Age
,其中 NaN
由文字问号表示。这是我想去的地方:
res = {
'Joe Schmoe': {
'race': 'W',
'gender': 'M',
'age': 64,
'officer': False
},
'Richard Johnson': {
'race': 'W',
'gender': 'M',
'age': 48,
'officer': True
},
'Adrian Stevens': {
'race': 'NaN',
'gender': 'NaN',
'age': 27,
'officer': False
}
}
现在我已经构建了两个函数来执行此操作。第一个在下面,处理人口统计信息字符串。 (我觉得这个没问题,放在这里仅供参考。)
import re
def fix_demographic(info):
# W / M / ?? --> W / M / NaN
# ?/M/? --> NaN / M / NaN
# Keep as str NaN rather than np.nan for now
race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
return race, gender, age
第二个函数解构列表并将其值放入字典结果中的不同位置:
demographic = re.compile(r'(\w+|\?+)\s*\/\s*(\w+|\?+)\s*\/\s*(\w+|\?+)')
def parse_victim_info(info: list):
res = defaultdict(dict)
for i in info:
if not demographic.fullmatch(i) and i.lower() != 'officer':
# We have a name
previous = 'name'
name = i
if i.lower() == 'officer':
res[name]['officer'] = True
previous = 'officer'
if demographic.fullmatch(i):
# We have demographic info; did "OFFICER" come before it?
if previous == 'name':
res[name]['officer'] = False
race, gender, age = fix_demographic(i)
res[name]['race'] = race
res[name]['gender'] = gender
res[name]['age'] = int(age) if age.isnumeric() else age
previous = None
return res
>>> parse_victim_info(info)
defaultdict(dict,
{'Adrian Stevens': {'age': 27,
'gender': 'NaN',
'officer': False,
'race': 'NaN'},
'Richard Johnson': {'age': 48,
'gender': 'M',
'officer': True,
# ... ...
第二个函数感觉太冗长乏味了。
有没有更好的方法能够更聪明地记住迭代中看到的最后一个值的分类?
这种东西很适合 generator:
代码:
def find_triplets(data):
data = iter(data)
while True:
name = next(data)
demo = next(data)
officer = demo == 'OFFICER'
if officer:
demo = next(data)
yield name, officer, demo
测试代码:
info = [
'Joe Schmoe',
'W / M / 64',
'Lillian Schmoe',
'W / F / 60',
'Richard Johnson',
'OFFICER',
'W / M /48',
'Adrian Stevens',
'? / ? / 27'
]
for x in find_triplets(info):
print(x)
结果:
('Joe Schmoe', False, 'W / M / 64')
('Lillian Schmoe', False, 'W / F / 60')
('Richard Johnson', True, 'W / M /48')
('Adrian Stevens', False, '? / ? / 27')
将元组三元组转换为 dict
:
import re
def fix_demographic(info):
# W / M / ?? --> W / M / NaN
# ?/M/? --> NaN / M / NaN
# Keep as str NaN rather than np.nan for now
race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
return dict(race=race, gender=gender, age=age)
data_dict = {name: dict(officer=officer, **fix_demographic(demo))
for name, officer, demo in find_triplets(info)}
print(data_dict)
结果:
{
'Joe Schmoe': {'officer': False, 'race': 'W', 'gender': 'M', 'age': '64'},
'Lillian Schmoe': {'officer': False, 'race': 'W', 'gender': 'F', 'age': '60'},
'Richard Johnson': {'officer': True, 'race': 'W', 'gender': 'M', 'age': '48'},
'Adrian Stevens': {'officer': False, 'race': 'NaN', 'gender': 'NaN', 'age': '27'}
}
您可以在 Python3 中使用 itertools.groupby
:
import itertools
import re
info = [
'Joe Schmoe',
'W / M / 64',
'Lillian Schmoe',
'W / F / 60',
'Richard Johnson',
'OFFICER',
'W / M /48',
'Adrian Stevens',
'? / ? / 27'
]
data = [list(b) for a, b in itertools.groupby(info, key=lambda x:x.count('/') > 0 or x == 'OFFICER')]
final_data = {data[i][0]:{**{a:'NaN' if b == '?' else (int(b) if b.isdigit() else b) for a, b in zip(['race', 'gender', 'age'], filter(None, re.split('\s+|/', [h for h in data[i+1] if h.count('/') > 0][0])))}, **{"officer":"OFFICER" in data[i+1]}} for i in range(0, len(data), 2)}
输出:
{'Joe Schmoe': {'race': 'W', 'gender': 'M', 'age': 64, 'officer': False}, 'Lillian Schmoe': {'race': 'W', 'gender': 'F', 'age': 60, 'officer': False}, 'Richard Johnson': {'race': 'W', 'gender': 'M', 'age': 48, 'officer': True}, 'Adrian Stevens': {'race': 'NaN', 'gender': 'NaN', 'age': 27, 'officer': False}}
我有一个 "unstructured" 列表,如下所示:
info = [
'Joe Schmoe',
'W / M / 64',
'Richard Johnson',
'OFFICER',
'W / M /48',
'Adrian Stevens',
'? / ? / 27'
]
非结构化,因为列表由以下集合组成:
- (姓名、官员身份、人口统计信息) 三胞胎,或
- (姓名,人口统计信息) 对。
在后一种情况下,Officer=False
,在前一种情况下,Officer=True
。人口统计信息字符串表示 Race / Gender / Age
,其中 NaN
由文字问号表示。这是我想去的地方:
res = {
'Joe Schmoe': {
'race': 'W',
'gender': 'M',
'age': 64,
'officer': False
},
'Richard Johnson': {
'race': 'W',
'gender': 'M',
'age': 48,
'officer': True
},
'Adrian Stevens': {
'race': 'NaN',
'gender': 'NaN',
'age': 27,
'officer': False
}
}
现在我已经构建了两个函数来执行此操作。第一个在下面,处理人口统计信息字符串。 (我觉得这个没问题,放在这里仅供参考。)
import re
def fix_demographic(info):
# W / M / ?? --> W / M / NaN
# ?/M/? --> NaN / M / NaN
# Keep as str NaN rather than np.nan for now
race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
return race, gender, age
第二个函数解构列表并将其值放入字典结果中的不同位置:
demographic = re.compile(r'(\w+|\?+)\s*\/\s*(\w+|\?+)\s*\/\s*(\w+|\?+)')
def parse_victim_info(info: list):
res = defaultdict(dict)
for i in info:
if not demographic.fullmatch(i) and i.lower() != 'officer':
# We have a name
previous = 'name'
name = i
if i.lower() == 'officer':
res[name]['officer'] = True
previous = 'officer'
if demographic.fullmatch(i):
# We have demographic info; did "OFFICER" come before it?
if previous == 'name':
res[name]['officer'] = False
race, gender, age = fix_demographic(i)
res[name]['race'] = race
res[name]['gender'] = gender
res[name]['age'] = int(age) if age.isnumeric() else age
previous = None
return res
>>> parse_victim_info(info)
defaultdict(dict,
{'Adrian Stevens': {'age': 27,
'gender': 'NaN',
'officer': False,
'race': 'NaN'},
'Richard Johnson': {'age': 48,
'gender': 'M',
'officer': True,
# ... ...
第二个函数感觉太冗长乏味了。
有没有更好的方法能够更聪明地记住迭代中看到的最后一个值的分类?
这种东西很适合 generator:
代码:
def find_triplets(data):
data = iter(data)
while True:
name = next(data)
demo = next(data)
officer = demo == 'OFFICER'
if officer:
demo = next(data)
yield name, officer, demo
测试代码:
info = [
'Joe Schmoe',
'W / M / 64',
'Lillian Schmoe',
'W / F / 60',
'Richard Johnson',
'OFFICER',
'W / M /48',
'Adrian Stevens',
'? / ? / 27'
]
for x in find_triplets(info):
print(x)
结果:
('Joe Schmoe', False, 'W / M / 64')
('Lillian Schmoe', False, 'W / F / 60')
('Richard Johnson', True, 'W / M /48')
('Adrian Stevens', False, '? / ? / 27')
将元组三元组转换为 dict
:
import re
def fix_demographic(info):
# W / M / ?? --> W / M / NaN
# ?/M/? --> NaN / M / NaN
# Keep as str NaN rather than np.nan for now
race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
return dict(race=race, gender=gender, age=age)
data_dict = {name: dict(officer=officer, **fix_demographic(demo))
for name, officer, demo in find_triplets(info)}
print(data_dict)
结果:
{
'Joe Schmoe': {'officer': False, 'race': 'W', 'gender': 'M', 'age': '64'},
'Lillian Schmoe': {'officer': False, 'race': 'W', 'gender': 'F', 'age': '60'},
'Richard Johnson': {'officer': True, 'race': 'W', 'gender': 'M', 'age': '48'},
'Adrian Stevens': {'officer': False, 'race': 'NaN', 'gender': 'NaN', 'age': '27'}
}
您可以在 Python3 中使用 itertools.groupby
:
import itertools
import re
info = [
'Joe Schmoe',
'W / M / 64',
'Lillian Schmoe',
'W / F / 60',
'Richard Johnson',
'OFFICER',
'W / M /48',
'Adrian Stevens',
'? / ? / 27'
]
data = [list(b) for a, b in itertools.groupby(info, key=lambda x:x.count('/') > 0 or x == 'OFFICER')]
final_data = {data[i][0]:{**{a:'NaN' if b == '?' else (int(b) if b.isdigit() else b) for a, b in zip(['race', 'gender', 'age'], filter(None, re.split('\s+|/', [h for h in data[i+1] if h.count('/') > 0][0])))}, **{"officer":"OFFICER" in data[i+1]}} for i in range(0, len(data), 2)}
输出:
{'Joe Schmoe': {'race': 'W', 'gender': 'M', 'age': 64, 'officer': False}, 'Lillian Schmoe': {'race': 'W', 'gender': 'F', 'age': 60, 'officer': False}, 'Richard Johnson': {'race': 'W', 'gender': 'M', 'age': 48, 'officer': True}, 'Adrian Stevens': {'race': 'NaN', 'gender': 'NaN', 'age': 27, 'officer': False}}