使用 pd 数据框从包含字典列表的列中提取组织

Extract organization from a column with list of dictionaries using pd dataframe

我的数据框有多个列,如 ID、组织、日期、位置等。我正在尝试提取“组织”列中的“组织”值。我想要的输出应该是新列中的多个组织名称,以逗号分隔。例如:

ID Organizations
1 [{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}]
2 [{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}]

我希望输出类似于:

ID Organizations
1 Glaxosmithkline, Vulpes Fund
2 Amazon, Sinovac

我尝试了以下代码(输出为 NaN):

latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']

已编辑: df.head(5)['organizations'].to_dict() 给我以下输出:

{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
 1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
 2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
 3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
 4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}

任何建议都会有所帮助。

这是你想要做的吗?

latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])

您可以将列表理解与应用结合使用:

import pandas as pd

df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])

输出:

newOrg Organizations
0 [{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}] ['Glaxosmithkline', 'Vulpes Life Sciences Fund']

1. 根据您最近的数据框更新更新:

data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
        '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
        '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
        '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
        '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data) 
df
index Organizations
0 [{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]
1 [{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]
2 [{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]
3 [{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]
4 [{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]

2. 在您想要的列上使用 ''.join() + regex.apply()

import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=]+=\s*([^=,}]+)', x)))
df

3. 结果:

index Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark

我个人认为,在将数据放入数据框之前,您应该尝试更好地废弃 and/or 清理数据。

看来你有一个字符串。可以使用regex提取由=分隔的键值对,pivot如下图:

(df['organizations'].str.extractall('([^{=,]+)= *([^=,}]+)') 
  .rename({0:'key', 1:'value'}, axis = 1).reset_index()
  .groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())

key      character_offset       organization
level_0                                     
0             14199, 1494        Vac, Health
1               700, 1711      Store, Museum
2              8232, 5517          Mart, Rep
3              3881, 5947       Lodge, Hotel
4              3881, 5947  Airport, Landmark

数据

d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
 1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
 2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
 3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
 4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}

df = pd.Series(d).to_frame('organizations')

你可以这样做:

df['organizations'].str.extractall(r"organization= *(\w+)") \
    .groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})

       Organizations
0        Vac, Health
1      Store, Museum
2          Mart, Rep
3       Lodge, Hotel
4  Airport, Landmark