JSON 到 pandas 多个文件路径的数据框

Question

我有一个包含 40 个客户数据文件的文件夹。每个客户都有一个 json 文件，其中包含不同的购买记录。一个示例路径是 ../customer_data/customer_1/transaction.json

我想将此 json 文件加载到包含 customer_id、date、instore 和 rewards 列的数据框中。客户 ID 是文件夹名称，然后对于 instore/rewards 中的每一组，我想要一个新行。

目标：上述文件应如下所示：

   customer_id| date                     | instore          | rewards
   customer_1 |2018-12-21T12:02:42-08:00 |  0               | 0
   customer_1 |2018-12-24T06:19:03-08:00 |98.25211334228516 | 16.764389038085938
   customer_1 |2018-12-24T06:19:03-08:00 |99.88800811767578 | 18.61212158203125

我尝试了以下代码，但收到此错误 ValueError：元数据名称弯曲冲突，需要区分前缀：

# path to file
p = Path('../customer_data/customer_1/transaction.json')

# read json
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

# create dataframe
df = json_normalize(data, record_path='purchase', meta=['instore', 'rewards'], errors='ignore')

任何建议都会有所帮助

Answer 1

您可以试试这个，customer_id 不在您的 json 中，所以我才编的：

path = '../customer_data/customer_1/transaction.json'
with open('1.json', 'r+') as f:
    data = json.load(f)

df = pd.json_normalize(data, record_path=['purchase'], meta=[['date'], ['tierLevel']])
df['customer_id'] = path.split('/')[2]
print(df)


     instore    rewards                       date tierLevel customer_id
0  98.252113  16.764389  2018-12-24T06:19:03-08:00         7  customer_1
1  99.888008  18.612122  2018-12-24T06:19:03-08:00         7  customer_1

Answer 2

使用 rglob 查找所有文件。
通过填充 purchase 键中的空列表来修复 data。
使用 parent & stem 从路径中获取客户 ID。
- 给定 p = Path('../customer_data/customer_1/transaction.json')
- p.parent.stem returns 'customer_1'

import pandas as pd
import json
from pathlib import Path

file_path = Path('../customer_data')
files = file_path.rglob('transaction.json')

df_list = list()
for file in files:

    # read json
    with file.open('r', encoding='utf-8') as f:
        data = json.loads(f.read())
    
    # fix purchase where list is empty
    for x in data:
        if not x['purchase']:  # checks if list is empty
            x['purchase'] = [{'instore': 0, 'rewards': 0}]
        
    # create dataframe
    df = pd.json_normalize(data, 'purchase', ['date', 'tierLevel'])
    
    # add customer
    df['customer_id'] = file.parent.stem
    
    # add to dataframe list
    df_list.append(df)
    

df = pd.concat(df_list)

Answer 3

你可以使用我的图书馆anyjsontodf.py

基本上：

import anyjsontodf as jd

df = jd.jsontodf(jsonfile)

Github: https://github.com/fSEACHAD/anyjsontodf

中等文章：https://medium.com/@fernando.garcia.varela/dancing-with-the-dictionary-transforming-any-json-to-pandas-3328b49269d0

希望对您有所帮助！

JSON 到 pandas 多个文件路径的数据框

JSON to pandas dataframe for multiple filepaths

python

json

pandas

json-normalize