Pandas: 将 List-Dictionary 列提取到单独的列和行中
Pandas: Extracting List-Dictionary column into separate columns and rows
我有这个数据框 df。
tweet_id
tweet_entites
1223395611921305601
[{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}]
1223395868092465153
NaN
1223396204093902849
[{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}]
1223396269655089154
[{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}]
我想将列表字典提取到单独的列中:
tweet_id
label
term
1223395611921305601
NORP
Chinese
1223395611921305601
ORG
InnoCare
1223395611921305601
GPE
Hong Kong
1223395868092465153
NaN
NaN
1223396204093902849
ORG
LIVE Press
1223396204093902849
ORG
Emergency Committee
1223396269655089154
CARDINAL
83
1223396269655089154
CARDINAL
2019nCoV
1223396269655089154
CARDINAL
83
新列将被命名为标签和术语。我看过参考资料,但没能找到与我想要的输出类似的参考资料。
如果已经有字典列表,请使用嵌套列表推导式替换缺失值:
zipped = zip(df['tweet_id'],
df['tweet_entites'].apply(lambda x: [{'label':np.nan}]
if isinstance(x, float)
else x))
df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
tweet_id label term
0 1223395611921305601 NORP Chinese
1 1223395611921305601 ORG InnoCare
2 1223395611921305601 GPE Hong Kong
3 1223395868092465153 NaN NaN
4 1223396204093902849 ORG LIVE Press
5 1223396204093902849 ORG Emergency Committee
6 1223396269655089154 CARDINAL 83
7 1223396269655089154 CARDINAL 2019nCoV
8 1223396269655089154 CARDINAL 83
如果列表中有字符串 repr 使用 ast.literal_eval
:
import ast
df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)
zipped = zip(df['tweet_id'], df['tweet_entites'])
df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
tweet_id label term
0 1223395611921305601 NORP Chinese
1 1223395611921305601 ORG InnoCare
2 1223395611921305601 GPE Hong Kong
3 1223395868092465153 None NaN
4 1223396204093902849 ORG LIVE Press
5 1223396204093902849 ORG Emergency Committee
6 1223396269655089154 CARDINAL 83
7 1223396269655089154 CARDINAL 2019nCoV
8 1223396269655089154 CARDINAL 83
如果 df['tweet_entites'] 是一个字符串那么你可以使用 eval 将它转换成列表:
import pandas as pd
df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")
frames = []
for row in df.to_dict(orient="records"):
for i in eval(row["tweet_entites"]):
i["tweet_id"] = int(row["tweet_id"])
frames.append(i)
new_df = pd.DataFrame(frames)
print(new_df)
我有这个数据框 df。
tweet_id | tweet_entites |
---|---|
1223395611921305601 | [{'label': 'NORP', 'term': 'Chinese'}, {'label': 'ORG', 'term': 'InnoCare'}, {'label': 'GPE', 'term': 'Hong Kong'}] |
1223395868092465153 | NaN |
1223396204093902849 | [{'label': 'ORG', 'term': 'LIVE Press'}, {'label': 'ORG', 'term': 'Emergency Committee'}] |
1223396269655089154 | [{'label': 'CARDINAL', 'term': '83'}, {'label': 'CARDINAL', 'term': '2019nCoV'}, {'label': 'CARDINAL', 'term': '83'}] |
我想将列表字典提取到单独的列中:
tweet_id | label | term |
---|---|---|
1223395611921305601 | NORP | Chinese |
1223395611921305601 | ORG | InnoCare |
1223395611921305601 | GPE | Hong Kong |
1223395868092465153 | NaN | NaN |
1223396204093902849 | ORG | LIVE Press |
1223396204093902849 | ORG | Emergency Committee |
1223396269655089154 | CARDINAL | 83 |
1223396269655089154 | CARDINAL | 2019nCoV |
1223396269655089154 | CARDINAL | 83 |
新列将被命名为标签和术语。我看过参考资料,但没能找到与我想要的输出类似的参考资料。
如果已经有字典列表,请使用嵌套列表推导式替换缺失值:
zipped = zip(df['tweet_id'],
df['tweet_entites'].apply(lambda x: [{'label':np.nan}]
if isinstance(x, float)
else x))
df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
tweet_id label term
0 1223395611921305601 NORP Chinese
1 1223395611921305601 ORG InnoCare
2 1223395611921305601 GPE Hong Kong
3 1223395868092465153 NaN NaN
4 1223396204093902849 ORG LIVE Press
5 1223396204093902849 ORG Emergency Committee
6 1223396269655089154 CARDINAL 83
7 1223396269655089154 CARDINAL 2019nCoV
8 1223396269655089154 CARDINAL 83
如果列表中有字符串 repr 使用 ast.literal_eval
:
import ast
df['tweet_entites'] = df['tweet_entites'].fillna('[{"label":None}]').apply(ast.literal_eval)
zipped = zip(df['tweet_id'], df['tweet_entites'])
df = [{**{'tweet_id': x}, **z} for x, y in zipped for z in y]
df = pd.DataFrame(df)
print (df)
tweet_id label term
0 1223395611921305601 NORP Chinese
1 1223395611921305601 ORG InnoCare
2 1223395611921305601 GPE Hong Kong
3 1223395868092465153 None NaN
4 1223396204093902849 ORG LIVE Press
5 1223396204093902849 ORG Emergency Committee
6 1223396269655089154 CARDINAL 83
7 1223396269655089154 CARDINAL 2019nCoV
8 1223396269655089154 CARDINAL 83
如果 df['tweet_entites'] 是一个字符串那么你可以使用 eval 将它转换成列表:
import pandas as pd
df = df.fillna("[{'label': None, 'term': None}, {'label': None, 'term': None}, {'label': None, 'term': None}]")
frames = []
for row in df.to_dict(orient="records"):
for i in eval(row["tweet_entites"]):
i["tweet_id"] = int(row["tweet_id"])
frames.append(i)
new_df = pd.DataFrame(frames)
print(new_df)