在 pandas 中导入嵌套字典数据
importing nested dictionary data in pandas
如果我的 json 文件看起来像这样...
!head test.json
{"Item":{"title":{"S":"https://medium.com/media/d40eb665beb374c0baaacb3b5a86534c/href"}}}
{"Item":{"title":{"S":"https://fasttext.cc/docs/en/autotune.html"}}}
{"Item":{"title":{"S":"https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf"}}}
{"Item":{"title":{"S":"https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/test/data/neuroblastoma-data-master/data/H3K27ac-H3K4me3_TDHAM_BP"}}}
我可以使用...导入 pandas 中的数据...
import pandas as pd
df = pd.read_json("test.json", lines=True, orient="columns")
但是数据是这样的...
Item
0 {'title': {'S': 'https://medium.com/media/d40e...
1 {'title': {'S': 'https://fasttext.cc/docs/en/a...
2 {'title': {'S': 'https://nlp.stanford.edu/~soc...
3 {'title': {'S': 'https://github.com/avinashbar...
我需要在一个列中包含所有 URL。
test.json
的有效 json 格式
[{"Item":{"title":{"S":"https://medium.com/media/d40eb665beb374c0baaacb3b5a86534c/href"}}},
{"Item":{"title":{"S":"https://fasttext.cc/docs/en/autotune.html"}}},
{"Item":{"title":{"S":"https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf"}}},
{"Item":{"title":{"S":"https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/test/data/neuroblastoma-data-master/data/H3K27ac-H3K4me3_TDHAM_BP"}}}]
使用此代码:
df = pd.read_json("test.json")
df['url'] = df['Item'].apply(lambda x: x.get('title').get('S'))
print(df['url'])
输出:
0 https://medium.com/media/d40eb665beb374c0baaac...
1 https://fasttext.cc/docs/en/autotune.html
2 https://nlp.stanford.edu/~socherr/EMNLP2013_RN...
3 https://github.com/avinashbarnwal/GSOC-2019/tr...
- 在这种情况下,最简单的方法是在
df
的 'Item'
列上使用 pandas.json_normalize
- 由于您有一列链接,我已经包含了代码以将其显示为笔记本中的可点击链接,或保存到 html 文件。
import pandas as pd
from IPython.display import HTML # used to show clickable link in a notebook
# read the file in as you are already doing
df = pd.read_json("test.json", lines=True, orient="columns")
# normalized the Item column
df = pd.json_normalize(df.Item)
# optional steps
# make the link clickable
df['title.S'] = '<a href=' + df['title.S'] + '>' + df['title.S'] + '</a>'
# display clickable dataframe in notebook
HTML(df.to_html(escape=False))
# save to html file
HTML(so.to_html('test.html', escape=False))
如果我的 json 文件看起来像这样...
!head test.json
{"Item":{"title":{"S":"https://medium.com/media/d40eb665beb374c0baaacb3b5a86534c/href"}}}
{"Item":{"title":{"S":"https://fasttext.cc/docs/en/autotune.html"}}}
{"Item":{"title":{"S":"https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf"}}}
{"Item":{"title":{"S":"https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/test/data/neuroblastoma-data-master/data/H3K27ac-H3K4me3_TDHAM_BP"}}}
我可以使用...导入 pandas 中的数据...
import pandas as pd
df = pd.read_json("test.json", lines=True, orient="columns")
但是数据是这样的...
Item
0 {'title': {'S': 'https://medium.com/media/d40e...
1 {'title': {'S': 'https://fasttext.cc/docs/en/a...
2 {'title': {'S': 'https://nlp.stanford.edu/~soc...
3 {'title': {'S': 'https://github.com/avinashbar...
我需要在一个列中包含所有 URL。
test.json
的有效 json 格式[{"Item":{"title":{"S":"https://medium.com/media/d40eb665beb374c0baaacb3b5a86534c/href"}}},
{"Item":{"title":{"S":"https://fasttext.cc/docs/en/autotune.html"}}},
{"Item":{"title":{"S":"https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf"}}},
{"Item":{"title":{"S":"https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/test/data/neuroblastoma-data-master/data/H3K27ac-H3K4me3_TDHAM_BP"}}}]
使用此代码:
df = pd.read_json("test.json")
df['url'] = df['Item'].apply(lambda x: x.get('title').get('S'))
print(df['url'])
输出:
0 https://medium.com/media/d40eb665beb374c0baaac...
1 https://fasttext.cc/docs/en/autotune.html
2 https://nlp.stanford.edu/~socherr/EMNLP2013_RN...
3 https://github.com/avinashbarnwal/GSOC-2019/tr...
- 在这种情况下,最简单的方法是在
df
的 - 由于您有一列链接,我已经包含了代码以将其显示为笔记本中的可点击链接,或保存到 html 文件。
'Item'
列上使用 pandas.json_normalize
import pandas as pd
from IPython.display import HTML # used to show clickable link in a notebook
# read the file in as you are already doing
df = pd.read_json("test.json", lines=True, orient="columns")
# normalized the Item column
df = pd.json_normalize(df.Item)
# optional steps
# make the link clickable
df['title.S'] = '<a href=' + df['title.S'] + '>' + df['title.S'] + '</a>'
# display clickable dataframe in notebook
HTML(df.to_html(escape=False))
# save to html file
HTML(so.to_html('test.html', escape=False))