在 python 中分解 pandas 嵌套数组

Question

我正在从 MongoDB 读取数据并放入 s3。使用 Athena 读取数据。

这是我的集合，其中包含项目列，它是一个数组。如何在将其保存到 s3 时将其分解为单独的列。

{"_id":{"$oid":"11111111"},
"receiptId":"rtrtrtrttrtrtrtr",
"paymentSystem":"CARD",
"lastFourDigit":"1111",
"cardType":"ghsl",
"paidOn":{"$numberLong":"1623078706000"},
"currency":"USD",
"totalAmountInCents":{"$numberInt":"0000"},
"items":[{"title":"Jun 21 - Jun 21,2022",
"description":"Starter",
"currency":"USD",
"amountInCents":{"$numberInt":"0000"},
"itemType":"SUBSCRIPTION_PLAN",
"id":{"$numberInt":"1"},
"frequency":"YEAR",
"periodStart":{"$numberLong":"1624288306000"},
"periodEnd":{"$numberLong":"1655824306000"}}],
"subscriptionPlanTitle":"Starter",
"subscriptionPlanFrequency":"YEAR",
"uuid":"1111111111",
"createTimestamp":{"$numberLong":"1624292188650"},
"updateTimestamp":{"$numberLong":"1624292188650"}}

Python 我试过的代码，

mylist = []
myresult = collection.find(query)
    mylist = []
    for x in myresult:
        mylist.append(x)
    df = json_normalize(mylist)
    df1 = df.applymap(str)

我可以将其保存到镶木地板中。但是所有项目都在一个列中。有没有动态爆炸的方法？

输出模式可能是


_id                          object
id                           object
createTimestamp              object
updateTimestamp              object
deleteTimestamp              object
receiptId                    object
paymentSystem                object
lastFourDigit                object
cardType                     object
paidOn                       object
currency                     object
totalAmountInCents           object
items.title                   object
items.description             object
items.currency                object
items.amountInCents           object
items.itemType               object
items.id                     object
items.frequency              object
items.periodstart            object
items.periodend              object
subscriptionPlanTitle        object
subscriptionPlanFrequency    object
uuid                         object
consumerEmail                object
taxAmountInCents             object
gifted                       object

Answer 1

您可以使用 json_normalize:

out = pd.json_normalize(data, ['items'], list(data.keys() - {'items'}), record_prefix = 'items.')

另一种选择是使用 data 创建一个 DataFrame；然后explode并用“items”列单独构建一个DataFrame；然后 join:

df = pd.json_normalize(data)
out1 = df.join(df['items'].explode().pipe(lambda x: pd.DataFrame(x.tolist())).add_prefix('items.')).drop(columns='items')

输出：

            items.title items.description items.currency     items.itemType  \
0  Jun 21 - Jun 21,2022           Starter            USD  SUBSCRIPTION_PLAN   

  items.frequency items.amountInCents.$numberInt items.id.$numberInt  \
0            YEAR                           0000                   1   

  items.periodStart.$numberLong items.periodEnd.$numberLong cardType  ...  \
0                 1624288306000               1655824306000     ghsl  ...   

         uuid lastFourDigit                   _id currency  \
0  1111111111          1111  {'$oid': '11111111'}      USD   

       totalAmountInCents                   createTimestamp  \
0  {'$numberInt': '0000'}  {'$numberLong': '1624292188650'}   

                             paidOn                   updateTimestamp  \
0  {'$numberLong': '1623078706000'}  {'$numberLong': '1624292188650'}   

  subscriptionPlanTitle paymentSystem  
0               Starter          CARD  

[1 rows x 22 columns]

请注意，元数据中的某些键（例如“taxAmountInCents”）在样本中不存在。

在 python 中分解 pandas 嵌套数组

explode the pandas nested array in python

dataframe

python-3.x

pandas

json-normalize

pandas-explode