Pandas dataframe在dict中查找key,根据value将key写入新列
Pandas dataframe look for key in dict, write key in new column according to value
我有一个数据框:
df = pd.DataFrame([
{'ID': 1,'A': [{'name': 'lifestyle'}, {'name': 'economy'},
{'name': 'politics'}, {'name': 'climate & environment'}]},
{'ID': 2,'A': [{'name': 'sport'}]},
{'ID': 3,'A': [{'name': 'climate & environment'}]},
{'ID': 4,'A': [{'name': 'sport'}]},
{'ID': 5,'A': [{'name': 'politics'}, {'name': 'world'}]},
{'ID': 6,'A': [{'name': 'economy'}, {'name': 'politics'}]}
])
col A 中的每个值都属于一个类别。这些类别在硬编码字典中 (categories.txt):
dict= {'lifestyle':'cat1',
'economy':'cat1',
'politics':'cat2',
'climate & environment':'cat2',
'sport':'cat3',
'world':'cat4',
'news':'cat3'}
我的目标是查找每个键并将此键写入以值(cat1、cat2、...)命名的新列中
这是我目前得到的:
df['A'] = [','.join(map(str, l)) for l in df['A']]
# read in the dict
d = {}
with open("categories.txt", "r") as file:
for line in file:
key, value = line.strip().split(":")
d[key] = value
di = {k: oldk for oldk, oldv in d.items() for k in oldv.split(',')}
for k, v in d.items():
if v == 'cat1':
df.loc[df['A'].str.contains(k), 'cat1'] = k
elif v == 'cat2':
df.loc[df['A'].str.contains(k), 'cat2'] = k
elif v == 'cat3':
df.loc[df['A'].str.contains(k), 'cat3'] = k
else:
df.loc[df['A'].str.contains(k), 'cat4'] = k
现在,如果每行有一个以上的键,这将被写为下一个类别,这是不正确的方法。如何获取以值命名的右列中的每个键(每个单元格一个或多个键)?
像这样:
df = pd.DataFrame({'ID':[1,2,3,4,5,6],
'cat1':['lifestyle, economy','nan','nan','nan','nan','economy, politics'],
'cat2':['politics, climate & environment','nan','climate & environment','nan','politics','nan'],
'cat3':['nan','sport','nan','sport','nan','nan'],
'cat4':['nan','nan','nan','nan','world','nan']})
提前致谢
使用 explode
和 apply
从您的第一个数据框中提取值,然后在数据透视表之前使用您的字典进行映射:
更新
Merging the ID col back with the categories does not work because the amount of entries is different. The ID col is crucial
out = df['A'].explode().apply(pd.Series).reset_index()
out['category'] = out['name'].map(d)
out = out.pivot_table(index='index', columns='category',
values='name', aggfunc=', '.join) \
.rename_axis(index=None, columns=None)
out = df[['ID']].join(out)
输出结果:
>>> out
ID cat1 cat2 cat3 cat4
0 1 lifestyle, economy politics, climate & environment NaN NaN
1 2 NaN NaN sport NaN
2 3 NaN climate & environment NaN NaN
3 4 NaN NaN sport NaN
4 5 NaN politics NaN world
5 6 economy politics NaN NaN
我有一个数据框:
df = pd.DataFrame([
{'ID': 1,'A': [{'name': 'lifestyle'}, {'name': 'economy'},
{'name': 'politics'}, {'name': 'climate & environment'}]},
{'ID': 2,'A': [{'name': 'sport'}]},
{'ID': 3,'A': [{'name': 'climate & environment'}]},
{'ID': 4,'A': [{'name': 'sport'}]},
{'ID': 5,'A': [{'name': 'politics'}, {'name': 'world'}]},
{'ID': 6,'A': [{'name': 'economy'}, {'name': 'politics'}]}
])
col A 中的每个值都属于一个类别。这些类别在硬编码字典中 (categories.txt):
dict= {'lifestyle':'cat1',
'economy':'cat1',
'politics':'cat2',
'climate & environment':'cat2',
'sport':'cat3',
'world':'cat4',
'news':'cat3'}
我的目标是查找每个键并将此键写入以值(cat1、cat2、...)命名的新列中
这是我目前得到的:
df['A'] = [','.join(map(str, l)) for l in df['A']]
# read in the dict
d = {}
with open("categories.txt", "r") as file:
for line in file:
key, value = line.strip().split(":")
d[key] = value
di = {k: oldk for oldk, oldv in d.items() for k in oldv.split(',')}
for k, v in d.items():
if v == 'cat1':
df.loc[df['A'].str.contains(k), 'cat1'] = k
elif v == 'cat2':
df.loc[df['A'].str.contains(k), 'cat2'] = k
elif v == 'cat3':
df.loc[df['A'].str.contains(k), 'cat3'] = k
else:
df.loc[df['A'].str.contains(k), 'cat4'] = k
现在,如果每行有一个以上的键,这将被写为下一个类别,这是不正确的方法。如何获取以值命名的右列中的每个键(每个单元格一个或多个键)? 像这样:
df = pd.DataFrame({'ID':[1,2,3,4,5,6],
'cat1':['lifestyle, economy','nan','nan','nan','nan','economy, politics'],
'cat2':['politics, climate & environment','nan','climate & environment','nan','politics','nan'],
'cat3':['nan','sport','nan','sport','nan','nan'],
'cat4':['nan','nan','nan','nan','world','nan']})
提前致谢
使用 explode
和 apply
从您的第一个数据框中提取值,然后在数据透视表之前使用您的字典进行映射:
更新
Merging the ID col back with the categories does not work because the amount of entries is different. The ID col is crucial
out = df['A'].explode().apply(pd.Series).reset_index()
out['category'] = out['name'].map(d)
out = out.pivot_table(index='index', columns='category',
values='name', aggfunc=', '.join) \
.rename_axis(index=None, columns=None)
out = df[['ID']].join(out)
输出结果:
>>> out
ID cat1 cat2 cat3 cat4
0 1 lifestyle, economy politics, climate & environment NaN NaN
1 2 NaN NaN sport NaN
2 3 NaN climate & environment NaN NaN
3 4 NaN NaN sport NaN
4 5 NaN politics NaN world
5 6 economy politics NaN NaN