在 python 中分解 pandas 嵌套数组
explode the pandas nested array in python
我正在从 MongoDB 读取数据并放入 s3。使用 Athena 读取数据。
这是我的集合,其中包含项目列,它是一个数组。如何在将其保存到 s3 时将其分解为单独的列。
{"_id":{"$oid":"11111111"},
"receiptId":"rtrtrtrttrtrtrtr",
"paymentSystem":"CARD",
"lastFourDigit":"1111",
"cardType":"ghsl",
"paidOn":{"$numberLong":"1623078706000"},
"currency":"USD",
"totalAmountInCents":{"$numberInt":"0000"},
"items":[{"title":"Jun 21 - Jun 21,2022",
"description":"Starter",
"currency":"USD",
"amountInCents":{"$numberInt":"0000"},
"itemType":"SUBSCRIPTION_PLAN",
"id":{"$numberInt":"1"},
"frequency":"YEAR",
"periodStart":{"$numberLong":"1624288306000"},
"periodEnd":{"$numberLong":"1655824306000"}}],
"subscriptionPlanTitle":"Starter",
"subscriptionPlanFrequency":"YEAR",
"uuid":"1111111111",
"createTimestamp":{"$numberLong":"1624292188650"},
"updateTimestamp":{"$numberLong":"1624292188650"}}
Python 我试过的代码,
mylist = []
myresult = collection.find(query)
mylist = []
for x in myresult:
mylist.append(x)
df = json_normalize(mylist)
df1 = df.applymap(str)
我可以将其保存到镶木地板中。但是所有项目都在一个列中。有没有动态爆炸的方法?
输出模式可能是
_id object
id object
createTimestamp object
updateTimestamp object
deleteTimestamp object
receiptId object
paymentSystem object
lastFourDigit object
cardType object
paidOn object
currency object
totalAmountInCents object
items.title object
items.description object
items.currency object
items.amountInCents object
items.itemType object
items.id object
items.frequency object
items.periodstart object
items.periodend object
subscriptionPlanTitle object
subscriptionPlanFrequency object
uuid object
consumerEmail object
taxAmountInCents object
gifted object
您可以使用 json_normalize
:
out = pd.json_normalize(data, ['items'], list(data.keys() - {'items'}), record_prefix = 'items.')
另一种选择是使用 data
创建一个 DataFrame;然后explode
并用“items”列单独构建一个DataFrame;然后 join
:
df = pd.json_normalize(data)
out1 = df.join(df['items'].explode().pipe(lambda x: pd.DataFrame(x.tolist())).add_prefix('items.')).drop(columns='items')
输出:
items.title items.description items.currency items.itemType \
0 Jun 21 - Jun 21,2022 Starter USD SUBSCRIPTION_PLAN
items.frequency items.amountInCents.$numberInt items.id.$numberInt \
0 YEAR 0000 1
items.periodStart.$numberLong items.periodEnd.$numberLong cardType ... \
0 1624288306000 1655824306000 ghsl ...
uuid lastFourDigit _id currency \
0 1111111111 1111 {'$oid': '11111111'} USD
totalAmountInCents createTimestamp \
0 {'$numberInt': '0000'} {'$numberLong': '1624292188650'}
paidOn updateTimestamp \
0 {'$numberLong': '1623078706000'} {'$numberLong': '1624292188650'}
subscriptionPlanTitle paymentSystem
0 Starter CARD
[1 rows x 22 columns]
请注意,元数据中的某些键(例如“taxAmountInCents”)在样本中不存在。
我正在从 MongoDB 读取数据并放入 s3。使用 Athena 读取数据。
这是我的集合,其中包含项目列,它是一个数组。如何在将其保存到 s3 时将其分解为单独的列。
{"_id":{"$oid":"11111111"},
"receiptId":"rtrtrtrttrtrtrtr",
"paymentSystem":"CARD",
"lastFourDigit":"1111",
"cardType":"ghsl",
"paidOn":{"$numberLong":"1623078706000"},
"currency":"USD",
"totalAmountInCents":{"$numberInt":"0000"},
"items":[{"title":"Jun 21 - Jun 21,2022",
"description":"Starter",
"currency":"USD",
"amountInCents":{"$numberInt":"0000"},
"itemType":"SUBSCRIPTION_PLAN",
"id":{"$numberInt":"1"},
"frequency":"YEAR",
"periodStart":{"$numberLong":"1624288306000"},
"periodEnd":{"$numberLong":"1655824306000"}}],
"subscriptionPlanTitle":"Starter",
"subscriptionPlanFrequency":"YEAR",
"uuid":"1111111111",
"createTimestamp":{"$numberLong":"1624292188650"},
"updateTimestamp":{"$numberLong":"1624292188650"}}
Python 我试过的代码,
mylist = []
myresult = collection.find(query)
mylist = []
for x in myresult:
mylist.append(x)
df = json_normalize(mylist)
df1 = df.applymap(str)
我可以将其保存到镶木地板中。但是所有项目都在一个列中。有没有动态爆炸的方法?
输出模式可能是
_id object
id object
createTimestamp object
updateTimestamp object
deleteTimestamp object
receiptId object
paymentSystem object
lastFourDigit object
cardType object
paidOn object
currency object
totalAmountInCents object
items.title object
items.description object
items.currency object
items.amountInCents object
items.itemType object
items.id object
items.frequency object
items.periodstart object
items.periodend object
subscriptionPlanTitle object
subscriptionPlanFrequency object
uuid object
consumerEmail object
taxAmountInCents object
gifted object
您可以使用 json_normalize
:
out = pd.json_normalize(data, ['items'], list(data.keys() - {'items'}), record_prefix = 'items.')
另一种选择是使用 data
创建一个 DataFrame;然后explode
并用“items”列单独构建一个DataFrame;然后 join
:
df = pd.json_normalize(data)
out1 = df.join(df['items'].explode().pipe(lambda x: pd.DataFrame(x.tolist())).add_prefix('items.')).drop(columns='items')
输出:
items.title items.description items.currency items.itemType \
0 Jun 21 - Jun 21,2022 Starter USD SUBSCRIPTION_PLAN
items.frequency items.amountInCents.$numberInt items.id.$numberInt \
0 YEAR 0000 1
items.periodStart.$numberLong items.periodEnd.$numberLong cardType ... \
0 1624288306000 1655824306000 ghsl ...
uuid lastFourDigit _id currency \
0 1111111111 1111 {'$oid': '11111111'} USD
totalAmountInCents createTimestamp \
0 {'$numberInt': '0000'} {'$numberLong': '1624292188650'}
paidOn updateTimestamp \
0 {'$numberLong': '1623078706000'} {'$numberLong': '1624292188650'}
subscriptionPlanTitle paymentSystem
0 Starter CARD
[1 rows x 22 columns]
请注意,元数据中的某些键(例如“taxAmountInCents”)在样本中不存在。