Pandas: 将 List-Dictionary 列提取到单独的列和行中

Pandas: Extracting List-Dictionary column into separate columns and rows

我有这个数据框 df。

tweet_id tweet_entites
1223395611921305601 [{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}]
1223395868092465153 NaN
1223396204093902849 [{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}]
1223396269655089154 [{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}]

我想将列表字典提取到单独的列中:

tweet_id label term
1223395611921305601 NORP Chinese
1223395611921305601 ORG InnoCare
1223395611921305601 GPE Hong Kong
1223395868092465153 NaN NaN
1223396204093902849 ORG LIVE Press
1223396204093902849 ORG Emergency Committee
1223396269655089154 CARDINAL 83
1223396269655089154 CARDINAL 2019nCoV
1223396269655089154 CARDINAL 83

新列将被命名为标签和术语。我看过参考资料,但没能找到与我想要的输出类似的参考资料。

如果已经有字典列表,请使用嵌套列表推导式替换缺失值:

zipped = zip(df['tweet_id'], 
             df['tweet_entites'].apply(lambda x: [{'label':np.nan}] 
                                                 if isinstance(x, float) 
                                                 else x))

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]

df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153       NaN                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

如果列表中有字符串 repr 使用 ast.literal_eval:

import ast

df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)


zipped = zip(df['tweet_id'], df['tweet_entites'])

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153      None                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

如果 df['tweet_entites'] 是一个字符串那么你可以使用 eval 将它转换成列表:

import pandas as pd

df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")

frames = []
for row in df.to_dict(orient="records"):
  for i in eval(row["tweet_entites"]):
    i["tweet_id"] = int(row["tweet_id"])
    frames.append(i)

new_df = pd.DataFrame(frames)
print(new_df)