使用 pd 数据框从包含字典列表的列中提取组织
Extract organization from a column with list of dictionaries using pd dataframe
我的数据框有多个列,如 ID、组织、日期、位置等。我正在尝试提取“组织”列中的“组织”值。我想要的输出应该是新列中的多个组织名称,以逗号分隔。例如:
ID
Organizations
1
[{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}]
2
[{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}]
我希望输出类似于:
ID
Organizations
1
Glaxosmithkline, Vulpes Fund
2
Amazon, Sinovac
我尝试了以下代码(输出为 NaN):
latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']
已编辑:
df.head(5)['organizations'].to_dict()
给我以下输出:
{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
任何建议都会有所帮助。
这是你想要做的吗?
latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])
您可以将列表理解与应用结合使用:
import pandas as pd
df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])
输出:
newOrg
Organizations
0
[{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}]
['Glaxosmithkline', 'Vulpes Life Sciences Fund']
1. 根据您最近的数据框更新更新:
data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
'[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
'[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
'[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
'[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data)
df
index
Organizations
0
[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]
1
[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]
2
[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]
3
[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]
4
[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]
2. 在您想要的列上使用 ''.join()
+ regex
和 .apply()
:
import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=]+=\s*([^=,}]+)', x)))
df
3. 结果:
index
Organizations
0
Vac, Health
1
Store, Museum
2
Mart, Rep
3
Lodge, Hotel
4
Airport, Landmark
我个人认为,在将数据放入数据框之前,您应该尝试更好地废弃 and/or 清理数据。
看来你有一个字符串。可以使用regex
提取由=
分隔的键值对,pivot如下图:
(df['organizations'].str.extractall('([^{=,]+)= *([^=,}]+)')
.rename({0:'key', 1:'value'}, axis = 1).reset_index()
.groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())
key character_offset organization
level_0
0 14199, 1494 Vac, Health
1 700, 1711 Store, Museum
2 8232, 5517 Mart, Rep
3 3881, 5947 Lodge, Hotel
4 3881, 5947 Airport, Landmark
数据
d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
df = pd.Series(d).to_frame('organizations')
你可以这样做:
df['organizations'].str.extractall(r"organization= *(\w+)") \
.groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})
Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark
我的数据框有多个列,如 ID、组织、日期、位置等。我正在尝试提取“组织”列中的“组织”值。我想要的输出应该是新列中的多个组织名称,以逗号分隔。例如:
ID | Organizations |
---|---|
1 | [{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}] |
2 | [{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}] |
我希望输出类似于:
ID | Organizations |
---|---|
1 | Glaxosmithkline, Vulpes Fund |
2 | Amazon, Sinovac |
我尝试了以下代码(输出为 NaN):
latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']
已编辑:
df.head(5)['organizations'].to_dict()
给我以下输出:
{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
任何建议都会有所帮助。
这是你想要做的吗?
latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])
您可以将列表理解与应用结合使用:
import pandas as pd
df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])
输出:
newOrg | Organizations | |
---|---|---|
0 | [{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}] | ['Glaxosmithkline', 'Vulpes Life Sciences Fund'] |
1. 根据您最近的数据框更新更新:
data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
'[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
'[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
'[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
'[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data)
df
index | Organizations |
---|---|
0 | [{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}] |
1 | [{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}] |
2 | [{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}] |
3 | [{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}] |
4 | [{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}] |
2. 在您想要的列上使用 ''.join()
+ regex
和 .apply()
:
import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=]+=\s*([^=,}]+)', x)))
df
3. 结果:
index | Organizations |
---|---|
0 | Vac, Health |
1 | Store, Museum |
2 | Mart, Rep |
3 | Lodge, Hotel |
4 | Airport, Landmark |
我个人认为,在将数据放入数据框之前,您应该尝试更好地废弃 and/or 清理数据。
看来你有一个字符串。可以使用regex
提取由=
分隔的键值对,pivot如下图:
(df['organizations'].str.extractall('([^{=,]+)= *([^=,}]+)')
.rename({0:'key', 1:'value'}, axis = 1).reset_index()
.groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())
key character_offset organization
level_0
0 14199, 1494 Vac, Health
1 700, 1711 Store, Museum
2 8232, 5517 Mart, Rep
3 3881, 5947 Lodge, Hotel
4 3881, 5947 Airport, Landmark
数据
d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
df = pd.Series(d).to_frame('organizations')
你可以这样做:
df['organizations'].str.extractall(r"organization= *(\w+)") \
.groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})
Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark