AWS Glue denest postgres jsonb 专栏
AWS Glue denest postgres jsonb column
我想将 jsonb 列展平为同一 table 中的多个目标列。我找不到完成此操作的内置函数。 Glue 爬虫将 jsonb 列注册为字符串。当我将数据放在 s3 上时,我可以使用 Unbox.apply() 将其更改为结构。
我试过使用 Relationalize 和 UnnestFrame 去嵌套 jsonb 列。既不工作。 Relationalize 似乎只适用于 .json 文件。我不确定为什么 UnnestFrame 不起作用。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mycatalogdb", table_name = "sourcedb_public_tablename", transformation_ctx = "datasource0")
dfc = UnnestFrame.apply(frame = datasource0, transformation_ctx = "dfc", info="", stageThreshold=0, totalThreshold=0)
dropnullfields3 = DropNullFields.apply(frame = dfc, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://mybucket"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
给定来源 table 具有以下内容
+----+------------+-------------------------------------------------------+
| id | date | myjson |
+----+------------+-------------------------------------------------------+
| 1 | 2019-10-10 | {"url":some-url,"data":{"afield":123,"moredata":567"} |
+----+------------+-------------------------------------------------------+
我想要这个输出(列名格式与表格格式无关紧要)
+----+------------+----------+-------------+---------------+
| id | date | url | data_afield | data_moredata |
+----+------------+----------+-------------+---------------+
| 1 | 2019-10-10 | some-url | 123 | 567 |
+----+------------+----------+-------------+---------------+
我最终发现,我错误地使用了关系化,但 Glue 没有抛出错误。在以交互方式使用 SageMaker 并在阅读 时意识到 relationalize() returns 一个集合后,我能够弄清楚这一点。
Relationalize 可用于包含 json 字段的数据框。换句话说,数据框不必来自纯 json.
我想将 jsonb 列展平为同一 table 中的多个目标列。我找不到完成此操作的内置函数。 Glue 爬虫将 jsonb 列注册为字符串。当我将数据放在 s3 上时,我可以使用 Unbox.apply() 将其更改为结构。
我试过使用 Relationalize 和 UnnestFrame 去嵌套 jsonb 列。既不工作。 Relationalize 似乎只适用于 .json 文件。我不确定为什么 UnnestFrame 不起作用。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mycatalogdb", table_name = "sourcedb_public_tablename", transformation_ctx = "datasource0")
dfc = UnnestFrame.apply(frame = datasource0, transformation_ctx = "dfc", info="", stageThreshold=0, totalThreshold=0)
dropnullfields3 = DropNullFields.apply(frame = dfc, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://mybucket"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
给定来源 table 具有以下内容
+----+------------+-------------------------------------------------------+
| id | date | myjson |
+----+------------+-------------------------------------------------------+
| 1 | 2019-10-10 | {"url":some-url,"data":{"afield":123,"moredata":567"} |
+----+------------+-------------------------------------------------------+
我想要这个输出(列名格式与表格格式无关紧要)
+----+------------+----------+-------------+---------------+
| id | date | url | data_afield | data_moredata |
+----+------------+----------+-------------+---------------+
| 1 | 2019-10-10 | some-url | 123 | 567 |
+----+------------+----------+-------------+---------------+
我最终发现,我错误地使用了关系化,但 Glue 没有抛出错误。在以交互方式使用 SageMaker 并在阅读
Relationalize 可用于包含 json 字段的数据框。换句话说,数据框不必来自纯 json.