提取字符串并根据原始索引插入多行

extract strings and insert as multiple rows based on original index

到目前为止,我已将示例数据集 (df)、预期输出 (df2) 和我的代码放在下面。 我有一个 df,其中 i2 列中的某些行包含一个列表 - 采用 json 格式,需要从提取它们的行中分解并重新插入到 df 中。但需要输入到不同的列(i1)。我需要从字符串中提取一个唯一标识符('id_2' 值)并将其插入到 id_2 列中。

到目前为止,在我的代码中,我使用 pd.normalize 解析类似 json 的数据,然后将列 i1 中的原始字符串插入到提取字符串的顶部(应该是如果你看下面会更清楚)然后根据索引重新插入它们。但是我必须指定索引,这不好。我希望它减少对手动输入索引的依赖,以防将来随着更多这些嵌套单元格发生变化或索引以某种方式发生变化。

非常欢迎任何建议,非常感谢

示例数据

import pandas as pd

df = pd.DataFrame(data={'id': [1, 2, 3, 4, 5], 'id_2': ['a','b','c','d','e'], 'i1': ['How old are you?','Over the last month have you felt','Do you live alone?','In the last week have you had','When did you last visit a doctor?'], 'i2': [0,0,0,0,0]})
df['i2'] = df['i2'].astype('object')

a = [{'id': 'b1', 'item': 'happy?', 'id_2': 'hj59'}, {'id': 'b2', 'item': 'sad?', 'id_2': 'dgb'}, {'id': 'b3', 'item': 'angry?', 'id_2':'kj9'}, {'id': 'b4', 'item': 'frustrated?','id2':'lp7'}]
b = [{'id': 'c1', 'item': 'trouble sleeping?'}, {'id': 'c2', 'item': 'changes in appetite?'}, {'id': 'c3', 'item': 'mood swings?'}, {'id': 'c4', 'item': 'trouble relaxing?'}]

df.at[1, 'i2'] = a 
df.at[3, 'i2'] = b 

预期输出

df2 = pd.DataFrame(data={'id': [1,2,2,2,2,3,4,4,4,4,5], 
                         'id_2': ['a','hj59','dgb','kj9','lp7','c','d','d','d','d','e'],
                         'i1': ['How old are you?',
                                'Over the last month have you felt happy?',
                                'Over the last month have you felt sad?',
                                'Over the last month have you felt angry?',
                                'Over the last month have you felt frustrated?',
                                'Do you live alone?',
                                'In the last week have you had trouble sleeping?',
                                'In the last week have you had changes in appetite?',
                                'In the last week have you had mood swings?',
                                'In the last week have you had trouble relaxing?',
                                'When did you last visit a doctor?'], 
                         'i2': [0,1,1,1,1,0,1,1,1,1,0]})

到目前为止我的丑陋代码

s={}
s = df[df.i2 != 0]

n={}

for i in range(len(s)):
    n[i] = pd.json_normalize(s.loc[s.index[i]]['i2']).reset_index(inplace=False, drop=False)  
    n[i]['i1'] = s.iloc[i].i1 + ' ' + n[i]['item']
    def insert_row(i, d1, d2): return d1.iloc[:i, ].append(d2)
    for i in n:
        if i == 0:
            x = insert_row(s.iloc[i].name, df, n[i])
        elif i == 1:
            x = insert_row(s.iloc[i].name+1+n[i]['index'].count()+1, x, n[i]) 
            y = x.append(df.iloc[s.iloc[i].name+1:, ])

Explodei2 上的数据框,然后使用 str 访问器从列 i2 中检索与键 item 关联的值,然后使用索引 loc 将列 i2 中的值更新为 1 并将 i1 中的字符串与检索到的项目值

连接起来
df2 = df.explode('i2', ignore_index=True)
s = df2['i2'].str['item']
df2.loc[s.notna(), 'i2'] =  1
df2.loc[s.notna(), 'i1'] += ' ' + s

    id                                                  i1 i2
0    1                                    How old are you?  0
1    2            Over the last month have you felt happy?  1
2    2              Over the last month have you felt sad?  1
3    2            Over the last month have you felt angry?  1
4    2       Over the last month have you felt frustrated?  1
5    3                                  Do you live alone?  0
6    4     In the last week have you had trouble sleeping?  1
7    4  In the last week have you had changes in appetite?  1
8    4          In the last week have you had mood swings?  1
9    4     In the last week have you had trouble relaxing?  1
10   5                   When did you last visit a doctor?  0