Pandas - 从单个列的值创建动态列
Pandas - Create dynamic column(s) from a single column's values
我有一个 json 数据,我正在计划将其转换为所需的数据帧后,将与另一个数据帧连接。
参与者
**row 1** [{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}]
**row 2** [{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}]
**row 3** [{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}]
但是,我希望它像这样动态生成列,并且 Participant#N Role 或 Participant#N Name 的“N”应该是整个数据帧行中存在的最大参与者数:
到目前为止我已经尝试过:
尝试 01:
participants = pd.concat([pd.DataFrame(pd.json_normalize(x)) for x in responses['participants']])
print(participants.transpose())
卡在这里,找不到任何相关的 post 以前进到所需的数据帧。
尝试 02:
responses['Role of Participants'] = [x[0]['roles'] for x in participants['roles']]
responses['Participant Name'] = [x[0]['life'] for x in participants['participants']]
但它只是返回角色中的第一个类型对象和每个数据的名字对象,其中可以有多个。
请帮忙!
你在找link这个东西吗?
# Extract the value from the "type" property of each subobject of each row
df['roles'] = df['roles'].apply(lambda x: ', '.join(t['type'] for t in x))
# Extract the value of "name" from the subobject of each row
df['life'] = df['life'].str['name']
# Rename 'life' column to 'name' (optional)
df = df.rename({'life': 'name'}, axis=1)
输出:
>>> df
roles name
0 board Erik Mølgaard
1 director, board, real_owner Mikael Bodholdt Linde
2 board, real_owner Dorte Bøcker Linde
尝试:
>>> df[df['name'] == 'Mikael Bodholdt Linde']
roles name
1 director, board, real_owner Mikael Bodholdt Linde
>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles']
1 director, board, real_owner
Name: roles, dtype: object
>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles'].iloc[0]
'director, board, real_owner'
>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles'].iloc[0].split(', ')
['director', 'board', 'real_owner']
你可以 运行 一个 apply()
将使用 for
循环将列表转换为 Series
和 headers - 它可以使用 enumerate
将正确的数字放入 headers.
因为有些行的参与者较少,所以它放置了 NaN
,您稍后可以用空字符串填充它。
接下来您可以使用 join()
将所有列添加为新列。因为 headers 是在 apply()
中创建的,所以您不必在 join()
中创建它们
import pandas as pd
data = {'participants':
[
[{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}],
[{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}],
[{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}],
]
}
df = pd.DataFrame(data)
def get_names(cell):
all_names = pd.Series(dtype=object)
for number, item in enumerate(cell, 1):
name = item['life']['name']
all_names[f'Participant #{number} Name'] = name
return all_names
def get_roles(cell):
all_roles = pd.Series(dtype=object)
for number, item in enumerate(cell, 1):
roles = [role['type'] for role in item['roles']]
all_roles[f'Participant #{number} Role'] = ",".join(roles)
return all_roles
roles = df['participants'].apply(get_roles)
roles = roles.fillna('') # put empty string in place of NaN
names = df['participants'].apply(get_names)
names = names.fillna('') # put empty string in place of NaN
df = df.join(roles)
df = df.join(names)
df = df.drop(columns=['participants']) # remove old column
pd.options.display.max_colwidth = 100
print(df.to_string())
结果:
Participant #1 Role Participant #2 Role Participant #3 Role Participant #1 Name Participant #2 Name Participant #3 Name
0 director,founder,owner,real_owner Lichun Du
1 board director,board,real_owner board,real_owner Erik Mølgaard Mikael Bodholdt Linde Dorte Bøcker Linde
2 director,real_owner owner Kristian Løth Hougaard WORLD JET HOLDING ApS
我使用了两个函数来获取第一个只有角色的列和下一个只有名称的列 - 但如果你需要 role1, name1, role2, name2, role3, name3
那么它可以用一个函数来完成。
import pandas as pd
data = {'participants':
[
[{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}],
[{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}],
[{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}],
]
}
df = pd.DataFrame(data)
def get_columns(cell):
results = pd.Series(dtype=object)
for number, item in enumerate(cell, 1):
name = item['life']['name']
results[f'Participant #{number} Name'] = name
roles = [role['type'] for role in item['roles']]
results[f'Participant #{number} Role'] = ",".join(roles)
return results
columns = df['participants'].apply(get_columns)
names = columns.fillna('') # put empty string in place of NaN
df = df.join(columns)
#print(df.columns)
df = df.drop(columns=['participants'])
pd.options.display.max_colwidth = 100
print(df.to_string())
结果:
Participant #1 Name Participant #1 Role Participant #2 Name Participant #2 Role Participant #3 Name Participant #3 Role
0 Lichun Du director,founder,owner,real_owner NaN NaN NaN NaN
1 Erik Mølgaard board Mikael Bodholdt Linde director,board,real_owner Dorte Bøcker Linde board,real_owner
2 Kristian Løth Hougaard director,real_owner WORLD JET HOLDING ApS owner NaN NaN
我有一个 json 数据,我正在计划将其转换为所需的数据帧后,将与另一个数据帧连接。
参与者
**row 1** [{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}]
**row 2** [{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}]
**row 3** [{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}]
但是,我希望它像这样动态生成列,并且 Participant#N Role 或 Participant#N Name 的“N”应该是整个数据帧行中存在的最大参与者数:
到目前为止我已经尝试过: 尝试 01:
participants = pd.concat([pd.DataFrame(pd.json_normalize(x)) for x in responses['participants']])
print(participants.transpose())
卡在这里,找不到任何相关的 post 以前进到所需的数据帧。
尝试 02:
responses['Role of Participants'] = [x[0]['roles'] for x in participants['roles']]
responses['Participant Name'] = [x[0]['life'] for x in participants['participants']]
但它只是返回角色中的第一个类型对象和每个数据的名字对象,其中可以有多个。
请帮忙!
你在找link这个东西吗?
# Extract the value from the "type" property of each subobject of each row
df['roles'] = df['roles'].apply(lambda x: ', '.join(t['type'] for t in x))
# Extract the value of "name" from the subobject of each row
df['life'] = df['life'].str['name']
# Rename 'life' column to 'name' (optional)
df = df.rename({'life': 'name'}, axis=1)
输出:
>>> df
roles name
0 board Erik Mølgaard
1 director, board, real_owner Mikael Bodholdt Linde
2 board, real_owner Dorte Bøcker Linde
尝试:
>>> df[df['name'] == 'Mikael Bodholdt Linde']
roles name
1 director, board, real_owner Mikael Bodholdt Linde
>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles']
1 director, board, real_owner
Name: roles, dtype: object
>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles'].iloc[0]
'director, board, real_owner'
>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles'].iloc[0].split(', ')
['director', 'board', 'real_owner']
你可以 运行 一个 apply()
将使用 for
循环将列表转换为 Series
和 headers - 它可以使用 enumerate
将正确的数字放入 headers.
因为有些行的参与者较少,所以它放置了 NaN
,您稍后可以用空字符串填充它。
接下来您可以使用 join()
将所有列添加为新列。因为 headers 是在 apply()
中创建的,所以您不必在 join()
import pandas as pd
data = {'participants':
[
[{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}],
[{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}],
[{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}],
]
}
df = pd.DataFrame(data)
def get_names(cell):
all_names = pd.Series(dtype=object)
for number, item in enumerate(cell, 1):
name = item['life']['name']
all_names[f'Participant #{number} Name'] = name
return all_names
def get_roles(cell):
all_roles = pd.Series(dtype=object)
for number, item in enumerate(cell, 1):
roles = [role['type'] for role in item['roles']]
all_roles[f'Participant #{number} Role'] = ",".join(roles)
return all_roles
roles = df['participants'].apply(get_roles)
roles = roles.fillna('') # put empty string in place of NaN
names = df['participants'].apply(get_names)
names = names.fillna('') # put empty string in place of NaN
df = df.join(roles)
df = df.join(names)
df = df.drop(columns=['participants']) # remove old column
pd.options.display.max_colwidth = 100
print(df.to_string())
结果:
Participant #1 Role Participant #2 Role Participant #3 Role Participant #1 Name Participant #2 Name Participant #3 Name
0 director,founder,owner,real_owner Lichun Du
1 board director,board,real_owner board,real_owner Erik Mølgaard Mikael Bodholdt Linde Dorte Bøcker Linde
2 director,real_owner owner Kristian Løth Hougaard WORLD JET HOLDING ApS
我使用了两个函数来获取第一个只有角色的列和下一个只有名称的列 - 但如果你需要 role1, name1, role2, name2, role3, name3
那么它可以用一个函数来完成。
import pandas as pd
data = {'participants':
[
[{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}],
[{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}],
[{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}],
]
}
df = pd.DataFrame(data)
def get_columns(cell):
results = pd.Series(dtype=object)
for number, item in enumerate(cell, 1):
name = item['life']['name']
results[f'Participant #{number} Name'] = name
roles = [role['type'] for role in item['roles']]
results[f'Participant #{number} Role'] = ",".join(roles)
return results
columns = df['participants'].apply(get_columns)
names = columns.fillna('') # put empty string in place of NaN
df = df.join(columns)
#print(df.columns)
df = df.drop(columns=['participants'])
pd.options.display.max_colwidth = 100
print(df.to_string())
结果:
Participant #1 Name Participant #1 Role Participant #2 Name Participant #2 Role Participant #3 Name Participant #3 Role
0 Lichun Du director,founder,owner,real_owner NaN NaN NaN NaN
1 Erik Mølgaard board Mikael Bodholdt Linde director,board,real_owner Dorte Bøcker Linde board,real_owner
2 Kristian Løth Hougaard director,real_owner WORLD JET HOLDING ApS owner NaN NaN