使用结构和数组结构将 Pandas Dataframe 转换为 JSON 以上传到 BigQuery
Transforming Pandas Dataframe into JSON with Struct and Array structure for Upload to BigQuery
假设我有来自 DataFrame 的数据,其中列 id
、title
和 category
、subcategory
和 sub-subcategory
看起来像:
_________________________________________________________________
| id | title | cat | subcat | subsubcat |
|____|______________|______________|_____________|______________|
| 1 | My Book | cat1 | subcat1 | subsubcat1 |
| 1 | My Book | cat2 | subcat2 | subsubcat2 |
| 2 | My Other Book| othercat1 | othersubcat1| othersubcat1 |
| 2 | My Other Book| othercat2 | othersubcat2| null |
| 2 | My Other Book| othercat3 | null | null |
|_______________________________________________________________|
我想把它变成一个(换行符分隔的)json,其结构如下:
[
{
'id' : '1',
'title' : 'My Book',
'categoryHiearchies': [
{'categories': ['category1', 'subcategory1', 'sub-subcategory1']},
{'categories': ['category2', 'subcategory2', 'sub-subcategory2']}
]
},
{
'id' : '2',
'title' : 'My Other Book',
'categoryHiearchies': [
{'categories': ['othercategory1', 'othersubcategory1', 'othersub-subcategory1']},
{'categories': ['othercategory2', 'othersubcategory2']},
{'categories': ['othercategory3']},
]
}
]
以便将其正确上传到 BigQuery。
对如何应用此转换有任何想法吗?
假设 null
是 Nan
值:
(df.set_index(['id','title'],append=True).stack()
.groupby(level=[0,1,2]).agg(lambda x: {'categories':list(x)})
.groupby(level=[1,2]).agg(list)
.reset_index(name='categoryHiearchies')
.to_json(orient='records', indent=2)
)
这给出了
[
{
"id":1,
"title":"My Book",
"categoryHiearchies":[
{
"categories":[
"cat1",
"subcat1",
"subsubcat1"
]
},
{
"categories":[
"cat2",
"subcat2",
"subsubcat2"
]
}
]
},
{
"id":2,
"title":"My Other Book",
"categoryHiearchies":[
{
"categories":[
"othercat1",
"othersubcat1",
"othersubcat1"
]
},
{
"categories":[
"othercat2",
"othersubcat2"
]
},
{
"categories":[
"othercat3"
]
}
]
}
]
假设我有来自 DataFrame 的数据,其中列 id
、title
和 category
、subcategory
和 sub-subcategory
看起来像:
_________________________________________________________________
| id | title | cat | subcat | subsubcat |
|____|______________|______________|_____________|______________|
| 1 | My Book | cat1 | subcat1 | subsubcat1 |
| 1 | My Book | cat2 | subcat2 | subsubcat2 |
| 2 | My Other Book| othercat1 | othersubcat1| othersubcat1 |
| 2 | My Other Book| othercat2 | othersubcat2| null |
| 2 | My Other Book| othercat3 | null | null |
|_______________________________________________________________|
我想把它变成一个(换行符分隔的)json,其结构如下:
[
{
'id' : '1',
'title' : 'My Book',
'categoryHiearchies': [
{'categories': ['category1', 'subcategory1', 'sub-subcategory1']},
{'categories': ['category2', 'subcategory2', 'sub-subcategory2']}
]
},
{
'id' : '2',
'title' : 'My Other Book',
'categoryHiearchies': [
{'categories': ['othercategory1', 'othersubcategory1', 'othersub-subcategory1']},
{'categories': ['othercategory2', 'othersubcategory2']},
{'categories': ['othercategory3']},
]
}
]
以便将其正确上传到 BigQuery。
对如何应用此转换有任何想法吗?
假设 null
是 Nan
值:
(df.set_index(['id','title'],append=True).stack()
.groupby(level=[0,1,2]).agg(lambda x: {'categories':list(x)})
.groupby(level=[1,2]).agg(list)
.reset_index(name='categoryHiearchies')
.to_json(orient='records', indent=2)
)
这给出了
[
{
"id":1,
"title":"My Book",
"categoryHiearchies":[
{
"categories":[
"cat1",
"subcat1",
"subsubcat1"
]
},
{
"categories":[
"cat2",
"subcat2",
"subsubcat2"
]
}
]
},
{
"id":2,
"title":"My Other Book",
"categoryHiearchies":[
{
"categories":[
"othercat1",
"othersubcat1",
"othersubcat1"
]
},
{
"categories":[
"othercat2",
"othersubcat2"
]
},
{
"categories":[
"othercat3"
]
}
]
}
]