Pandas json 归一化为什么它 returns NaN 对于重复值

Pandas json normalize why it returns NaN for repeated values

我有一个数据框,其中 2 列是我想扩展为单独列的字典列表。例如:

 id    text                   agg_inds                                           agg_tars
 1     some text    [{"f1": [15], "f2": "2263"}, {"f1": [16], "f2": "2171"}]    [{"f1": [5, 6, 12], "f2": "2984"}]

我想为名为 ind_posind_id 的嵌套列 agg_inds 创建 2 个列,为名为 tar_posagg_tarsagg_tars 创建 2 个不同的列 tar_id.

使用 json_normalize 的问题是当一个值重复时它 returns NaN 值。例如,在上面的行中,我想要这样:

期望输出

id  text        ind_pos     ind_id         tar_pos        tar_ind
1               [15]        2263           [5, 6, 12]     2984    
1   some text   [16]        2171           [5, 6, 12]     2984    

但这里是当前输出:

id  tex  ind_pos      ind_id        tar_pos        tar_ind
1         [15]        2263          [5, 6, 12]     2984    
1   NaN   [16]        2171          NaN            NaN    

代码如下:

s = (df.set_index('id')
          .apply(lambda x: x.apply(pd.Series).stack())
          .reset_index()
          .drop('level_1', 1))

s_ind = pd.json_normalize(s['agg_inds'])
columns_renaming = {"f1": "ind_pos", "f2": "ind_id"}
s_ind.rename(columns=columns_renaming, inplace=True)

s_tar= pd.json_normalize(s['agg_targets'])
columns_renaming = {"f1": "tar_pos", "f2": "tar_id"}
s_tar.rename(columns=columns_renaming, inplace=True)

s = s.drop(columns=['agg_inds', 'agg_targets'])
df_1 = s.join(s_ind)
df_final = df_1.join(s_tar)
print(df_final)

您可以使用 ffillNaN 值替换为上一行的值。为此,您只需添加一行:

s = (df.set_index('id')
          .apply(lambda x: x.apply(pd.Series).stack())
          .reset_index()
          .drop('level_1', 1))

s.ffill(inplace=True)

s_ind = pd.json_normalize(s['agg_inds'])
columns_renaming = {"f1": "ind_pos", "f2": "ind_id"}
s_ind.rename(columns=columns_renaming, inplace=True)

s_tar= pd.json_normalize(s['agg_targets'])
columns_renaming = {"f1": "tar_pos", "f2": "tar_id"}
s_tar.rename(columns=columns_renaming, inplace=True)

s = s.drop(columns=['agg_inds', 'agg_targets'])
df_1 = s.join(s_ind)
df_final = df_1.join(s_tar)
print(df_final)

输出:

   id        text ind_pos ind_id     tar_pos tar_id
0   1   some text    [15]   2263  [5, 6, 12]   2984
1   1   some text    [16]   2171  [5, 6, 12]   2984

如果我有这样的初始数据集:

# import libraries
import pandas as pd
import json

# read data
df = pd.DataFrame({
    'id': [1],
    'text': ['some text'],
    'agg_inds': ['[{"f1": [15], "f2": "2263"}, {"f1": [16], "f2": "2171"}]'],
    'agg_tars': ['[{"f1": [5, 6, 12], "f2": "2984"}]'],
})

# convert columns
for col in ['agg_inds', 'agg_tars']:
    df[col] = df[col].apply(lambda x: json.loads(x))

# set id column as index
df = df.set_index('id')

然后我可以重用你的处理逻辑来创建一个函数来提取这两种类型的特征,并从中创建一个数据框:

def extract_features(col: str, feat_new_names: dict):
    return (
        df[col]
        .apply(pd.Series)
        .stack()
        .apply(pd.Series)
        .reset_index()
        .drop(['level_1'], axis=1)
        .set_index('id')
        .rename(columns=feat_new_names)
    )

df_agg_inds = extract_features(col='agg_inds', feat_new_names={'f1': 'ind_pos', 'f2': 'ind_id'})
df_agg_tars = extract_features(col='agg_tars', feat_new_names={'f1': 'tar_pos', 'f2': 'tar_id'})

因为所有数据框都将 id 列设置为索引,所以我可以使用外部连接将它们放在一起而不会丢失任何信息:

df_final = (
    pd
    .concat([df, df_agg_inds, df_agg_tars], axis=1)
    .drop(['agg_inds', 'agg_tars'], axis=1)
)

结果如下所示:

>>> print(df_final)
         text ind_pos ind_id     tar_pos tar_id
id                                             
1   some text    [15]   2263  [5, 6, 12]   2984
1   some text    [16]   2171  [5, 6, 12]   2984

希望对您有所帮助 <3