JSON 到 pandas 多个文件路径的数据框

JSON to pandas dataframe for multiple filepaths

我有一个包含 40 个客户数据文件的文件夹。每个客户都有一个 json 文件,其中包含不同的购买记录。一个示例路径是 ../customer_data/customer_1/transaction.json

我想将此 json 文件加载到包含 customer_iddateinstorerewards 列的数据框中。客户 ID 是文件夹名称,然后对于 instore/rewards 中的每一组,我想要一个新行。

目标:上述文件应如下所示:

   customer_id| date                     | instore          | rewards
   customer_1 |2018-12-21T12:02:42-08:00 |  0               | 0
   customer_1 |2018-12-24T06:19:03-08:00 |98.25211334228516 | 16.764389038085938
   customer_1 |2018-12-24T06:19:03-08:00 |99.88800811767578 | 18.61212158203125

我尝试了以下代码,但收到此错误 ValueError:元数据名称弯曲冲突,需要区分前缀:

# path to file
p = Path('../customer_data/customer_1/transaction.json')

# read json
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

# create dataframe
df = json_normalize(data, record_path='purchase', meta=['instore', 'rewards'], errors='ignore')


任何建议都会有所帮助

您可以试试这个,customer_id 不在您的 json 中,所以我才编的:

path = '../customer_data/customer_1/transaction.json'
with open('1.json', 'r+') as f:
    data = json.load(f)

df = pd.json_normalize(data, record_path=['purchase'], meta=[['date'], ['tierLevel']])
df['customer_id'] = path.split('/')[2]
print(df)


     instore    rewards                       date tierLevel customer_id
0  98.252113  16.764389  2018-12-24T06:19:03-08:00         7  customer_1
1  99.888008  18.612122  2018-12-24T06:19:03-08:00         7  customer_1
  • 使用 rglob 查找所有文件。
  • 通过填充 purchase 键中的空列表来修复 data
  • 使用 parent & stem 从路径中获取客户 ID。
    • 给定 p = Path('../customer_data/customer_1/transaction.json')
    • p.parent.stem returns 'customer_1'
import pandas as pd
import json
from pathlib import Path

file_path = Path('../customer_data')
files = file_path.rglob('transaction.json')

df_list = list()
for file in files:

    # read json
    with file.open('r', encoding='utf-8') as f:
        data = json.loads(f.read())
    
    # fix purchase where list is empty
    for x in data:
        if not x['purchase']:  # checks if list is empty
            x['purchase'] = [{'instore': 0, 'rewards': 0}]
        
    # create dataframe
    df = pd.json_normalize(data, 'purchase', ['date', 'tierLevel'])
    
    # add customer
    df['customer_id'] = file.parent.stem
    
    # add to dataframe list
    df_list.append(df)
    

df = pd.concat(df_list)

你可以使用我的图书馆anyjsontodf.py

基本上:

import anyjsontodf as jd

df = jd.jsontodf(jsonfile)

Github: https://github.com/fSEACHAD/anyjsontodf

中等文章https://medium.com/@fernando.garcia.varela/dancing-with-the-dictionary-transforming-any-json-to-pandas-3328b49269d0

希望对您有所帮助!