Pandas: 将 List-Dictionary 列提取到单独的列和行中

Question

我有这个数据框 df。

tweet_id	tweet_entites
1223395611921305601	[{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}]
1223395868092465153	NaN
1223396204093902849	[{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}]
1223396269655089154	[{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}]

我想将列表字典提取到单独的列中：

tweet_id	label	term
1223395611921305601	NORP	Chinese
1223395611921305601	ORG	InnoCare
1223395611921305601	GPE	Hong Kong
1223395868092465153	NaN	NaN
1223396204093902849	ORG	LIVE Press
1223396204093902849	ORG	Emergency Committee
1223396269655089154	CARDINAL	83
1223396269655089154	CARDINAL	2019nCoV
1223396269655089154	CARDINAL	83

新列将被命名为标签和术语。我看过参考资料，但没能找到与我想要的输出类似的参考资料。

Answer 1

如果已经有字典列表，请使用嵌套列表推导式替换缺失值：

zipped = zip(df['tweet_id'], 
             df['tweet_entites'].apply(lambda x: [{'label':np.nan}] 
                                                 if isinstance(x, float) 
                                                 else x))

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]

df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153       NaN                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

如果列表中有字符串 repr 使用 ast.literal_eval:

import ast

df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)


zipped = zip(df['tweet_id'], df['tweet_entites'])

df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
              tweet_id     label                 term
0  1223395611921305601      NORP              Chinese
1  1223395611921305601       ORG             InnoCare
2  1223395611921305601       GPE            Hong Kong
3  1223395868092465153      None                  NaN
4  1223396204093902849       ORG           LIVE Press
5  1223396204093902849       ORG  Emergency Committee
6  1223396269655089154  CARDINAL                   83
7  1223396269655089154  CARDINAL             2019nCoV
8  1223396269655089154  CARDINAL                   83

Answer 2

如果 df['tweet_entites'] 是一个字符串那么你可以使用 eval 将它转换成列表：

import pandas as pd

df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")

frames = []
for row in df.to_dict(orient="records"):
  for i in eval(row["tweet_entites"]):
    i["tweet_id"] = int(row["tweet_id"])
    frames.append(i)

new_df = pd.DataFrame(frames)
print(new_df)

Pandas: 将 List-Dictionary 列提取到单独的列和行中

Pandas: Extracting List-Dictionary column into separate columns and rows

python

extract

pandas