在 pandas 中导入嵌套字典数据

importing nested dictionary data in pandas

如果我的 json 文件看起来像这样...

!head test.json

{"Item":{"title":{"S":"https://medium.com/media/d40eb665beb374c0baaacb3b5a86534c/href"}}}
{"Item":{"title":{"S":"https://fasttext.cc/docs/en/autotune.html"}}}
{"Item":{"title":{"S":"https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf"}}}
{"Item":{"title":{"S":"https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/test/data/neuroblastoma-data-master/data/H3K27ac-H3K4me3_TDHAM_BP"}}}

我可以使用...导入 pandas 中的数据...

import pandas as pd
df = pd.read_json("test.json", lines=True, orient="columns")

但是数据是这样的...

Item
0   {'title': {'S': 'https://medium.com/media/d40e...
1   {'title': {'S': 'https://fasttext.cc/docs/en/a...
2   {'title': {'S': 'https://nlp.stanford.edu/~soc...
3   {'title': {'S': 'https://github.com/avinashbar...

我需要在一个列中包含所有 URL。

test.json

的有效 json 格式
[{"Item":{"title":{"S":"https://medium.com/media/d40eb665beb374c0baaacb3b5a86534c/href"}}},
{"Item":{"title":{"S":"https://fasttext.cc/docs/en/autotune.html"}}},
{"Item":{"title":{"S":"https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf"}}},
{"Item":{"title":{"S":"https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/test/data/neuroblastoma-data-master/data/H3K27ac-H3K4me3_TDHAM_BP"}}}]

使用此代码:

df = pd.read_json("test.json")
df['url'] = df['Item'].apply(lambda x: x.get('title').get('S'))
print(df['url'])

输出:

0    https://medium.com/media/d40eb665beb374c0baaac...
1            https://fasttext.cc/docs/en/autotune.html
2    https://nlp.stanford.edu/~socherr/EMNLP2013_RN...
3    https://github.com/avinashbarnwal/GSOC-2019/tr...
  • 在这种情况下,最简单的方法是在 df
  • 'Item' 列上使用 pandas.json_normalize
  • 由于您有一列链接,我已经包含了代码以将其显示为笔记本中的可点击链接,或保存到 html 文件。
import pandas as pd
from IPython.display import HTML  # used to show clickable link in a notebook

# read the file in as you are already doing
df = pd.read_json("test.json", lines=True, orient="columns")

# normalized the Item column
df = pd.json_normalize(df.Item)

# optional steps
# make the link clickable
df['title.S'] = '<a href=' + df['title.S'] + '>' +  df['title.S'] + '</a>'

# display clickable dataframe in notebook
HTML(df.to_html(escape=False))

# save to html file
HTML(so.to_html('test.html', escape=False))