Pandas dataframe在dict中查找key,根据value将key写入新列

Pandas dataframe look for key in dict, write key in new column according to value

我有一个数据框:

df = pd.DataFrame([
                   {'ID': 1,'A': [{'name': 'lifestyle'}, {'name': 'economy'}, 
                          {'name': 'politics'}, {'name': 'climate & environment'}]}, 
                   {'ID': 2,'A': [{'name': 'sport'}]}, 
                   {'ID': 3,'A': [{'name': 'climate & environment'}]},
                   {'ID': 4,'A': [{'name': 'sport'}]},
                   {'ID': 5,'A': [{'name': 'politics'}, {'name': 'world'}]},
                   {'ID': 6,'A': [{'name': 'economy'}, {'name': 'politics'}]}
                  ])

col A 中的每个值都属于一个类别。这些类别在硬编码字典中 (categories.txt):

dict= {'lifestyle':'cat1',
'economy':'cat1',
'politics':'cat2',
'climate & environment':'cat2',
'sport':'cat3',
'world':'cat4',
'news':'cat3'}

我的目标是查找每个键并将此键写入以值(cat1、cat2、...)命名的新列中

这是我目前得到的:

df['A'] = [','.join(map(str, l)) for l in df['A']]
# read in the dict
d = {}
with open("categories.txt", "r") as file:
    for line in file:
        key, value = line.strip().split(":")
        d[key] = value


di = {k: oldk for oldk, oldv in d.items() for k in oldv.split(',')}


for k, v in d.items():
    if v == 'cat1':
        df.loc[df['A'].str.contains(k), 'cat1'] = k
    elif v == 'cat2':
        df.loc[df['A'].str.contains(k), 'cat2'] = k
    elif v == 'cat3':
        df.loc[df['A'].str.contains(k), 'cat3'] = k 
    else:
        df.loc[df['A'].str.contains(k), 'cat4'] = k

现在,如果每行有一个以上的键,这将被写为下一个类别,这是不正确的方法。如何获取以值命名的右列中的每个键(每个单元格一个或多个键)? 像这样:

df = pd.DataFrame({'ID':[1,2,3,4,5,6],
                   'cat1':['lifestyle, economy','nan','nan','nan','nan','economy, politics'],
                   'cat2':['politics, climate & environment','nan','climate & environment','nan','politics','nan'],
                   'cat3':['nan','sport','nan','sport','nan','nan'],
                   'cat4':['nan','nan','nan','nan','world','nan']})

提前致谢

使用 explodeapply 从您的第一个数据框中提取值,然后在数据透视表之前使用您的字典进行映射:

更新

Merging the ID col back with the categories does not work because the amount of entries is different. The ID col is crucial

out = df['A'].explode().apply(pd.Series).reset_index()
out['category'] = out['name'].map(d)
out = out.pivot_table(index='index', columns='category',
                      values='name', aggfunc=', '.join) \
         .rename_axis(index=None, columns=None)
out = df[['ID']].join(out)

输出结果:

>>> out
   ID                cat1                             cat2   cat3   cat4
0   1  lifestyle, economy  politics, climate & environment    NaN    NaN
1   2                 NaN                              NaN  sport    NaN
2   3                 NaN            climate & environment    NaN    NaN
3   4                 NaN                              NaN  sport    NaN
4   5                 NaN                         politics    NaN  world
5   6             economy                         politics    NaN    NaN