Pandas

Question

我有一个 json 数据，我正在计划将其转换为所需的数据帧后，将与另一个数据帧连接。

参与者

**row 1** [{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}]

**row 2** [{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}]

**row 3** [{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}]

但是，我希望它像这样动态生成列，并且 Participant#N Role 或 Participant#N Name 的“N”应该是整个数据帧行中存在的最大参与者数：

到目前为止我已经尝试过： 尝试 01：

participants = pd.concat([pd.DataFrame(pd.json_normalize(x)) for x in responses['participants']])
print(participants.transpose())

卡在这里，找不到任何相关的 post 以前进到所需的数据帧。

尝试 02:

responses['Role of Participants'] = [x[0]['roles'] for x in participants['roles']]
responses['Participant Name'] = [x[0]['life'] for x in participants['participants']]

但它只是返回角色中的第一个类型对象和每个数据的名字对象，其中可以有多个。

请帮忙！

Answer 1

你在找link这个东西吗？

# Extract the value from the "type" property of each subobject of each row
df['roles'] = df['roles'].apply(lambda x: ', '.join(t['type'] for t in x))

# Extract the value of "name" from the subobject of each row
df['life'] = df['life'].str['name']

# Rename 'life' column to 'name' (optional)
df = df.rename({'life': 'name'}, axis=1)

输出：

>>> df
                         roles                   name
0                        board          Erik Mølgaard
1  director, board, real_owner  Mikael Bodholdt Linde
2            board, real_owner     Dorte Bøcker Linde

尝试：

>>> df[df['name'] == 'Mikael Bodholdt Linde']
                         roles                   name
1  director, board, real_owner  Mikael Bodholdt Linde


>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles']
1    director, board, real_owner
Name: roles, dtype: object


>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles'].iloc[0]
'director, board, real_owner'


>>> df[df['name'] == 'Mikael Bodholdt Linde']['roles'].iloc[0].split(', ')
['director', 'board', 'real_owner']

Answer 2

你可以运行一个 apply() 将使用 for 循环将列表转换为 Series 和 headers - 它可以使用 enumerate 将正确的数字放入 headers.

因为有些行的参与者较少，所以它放置了 NaN，您稍后可以用空字符串填充它。

接下来您可以使用 join() 将所有列添加为新列。因为 headers 是在 apply() 中创建的，所以您不必在 join()

中创建它们

import pandas as pd

data = {'participants': 
[
    [{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}],
    [{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}],
    [{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}],
]
}

df = pd.DataFrame(data)

def get_names(cell):
    
    all_names = pd.Series(dtype=object)
    
    for number, item in enumerate(cell, 1):
        name = item['life']['name']
        all_names[f'Participant #{number} Name'] = name

    return all_names

def get_roles(cell):
    
    all_roles = pd.Series(dtype=object)
    
    for number, item in enumerate(cell, 1):
        roles = [role['type'] for role in item['roles']]
        all_roles[f'Participant #{number} Role'] = ",".join(roles)

    return all_roles

roles = df['participants'].apply(get_roles)
roles = roles.fillna('')  # put empty string in place of NaN

names = df['participants'].apply(get_names)
names = names.fillna('')  # put empty string in place of NaN

df = df.join(roles)
df = df.join(names)

df = df.drop(columns=['participants'])  # remove old column

pd.options.display.max_colwidth = 100
print(df.to_string())

结果：

                 Participant #1 Role        Participant #2 Role Participant #3 Role     Participant #1 Name    Participant #2 Name Participant #3 Name
0  director,founder,owner,real_owner                                                              Lichun Du                                           
1                              board  director,board,real_owner    board,real_owner           Erik Mølgaard  Mikael Bodholdt Linde  Dorte Bøcker Linde
2                director,real_owner                      owner                      Kristian Løth Hougaard  WORLD JET HOLDING ApS

我使用了两个函数来获取第一个只有角色的列和下一个只有名称的列 - 但如果你需要 role1, name1, role2, name2, role3, name3 那么它可以用一个函数来完成。

import pandas as pd

data = {'participants': 
[
    [{'roles': [{'type': 'director'}, {'type': 'founder'}, {'type': 'owner'}, {'type': 'real_owner'}], 'life': {'name': 'Lichun Du'}}],
    [{'roles': [{'type': 'board'}], 'life': {'name': 'Erik Mølgaard'}}, {'roles': [{'type': 'director'}, {'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Mikael Bodholdt Linde'}}, {'roles': [{'type': 'board'}, {'type': 'real_owner'}], 'life': {'name': 'Dorte Bøcker Linde'}}],
    [{'roles': [{'type': 'director'}, {'type': 'real_owner'}], 'life': {'name': 'Kristian Løth Hougaard'}}, {'roles': [{'type': 'owner'}], 'life': {'name': 'WORLD JET HOLDING ApS'}}],
]
}

df = pd.DataFrame(data)

def get_columns(cell):
    
    results = pd.Series(dtype=object)
    
    for number, item in enumerate(cell, 1):
        name = item['life']['name']
        results[f'Participant #{number} Name'] = name

        roles = [role['type'] for role in item['roles']]
        results[f'Participant #{number} Role'] = ",".join(roles)

    return results

columns = df['participants'].apply(get_columns)
names = columns.fillna('')  # put empty string in place of NaN

df = df.join(columns)
#print(df.columns)

df = df.drop(columns=['participants'])

pd.options.display.max_colwidth = 100
print(df.to_string())

结果：

      Participant #1 Name                Participant #1 Role    Participant #2 Name        Participant #2 Role Participant #3 Name Participant #3 Role
0               Lichun Du  director,founder,owner,real_owner                    NaN                        NaN                 NaN                 NaN
1           Erik Mølgaard                              board  Mikael Bodholdt Linde  director,board,real_owner  Dorte Bøcker Linde    board,real_owner
2  Kristian Løth Hougaard                director,real_owner  WORLD JET HOLDING ApS                      owner                 NaN                 NaN

Pandas - 从单个列的值创建动态列

Pandas - Create dynamic column(s) from a single column's values

python

dataframe

pandas-groupby