在 Glue 中使用 Relationalize 时,root table 中没有 id
When using Relationalize in Glue there is no id in root table
我在 Glue
中有一个 DynamicFrame
,我正在使用 Relationalize
方法创建 3 个新的动态帧; root_table
、root_table_1
和 root_table_2
。
当我打印表的架构或将表插入数据库后,我注意到在 root_table
中缺少 id,因此我无法在 root_table
和其他表之间进行连接.
我尝试了所有可能的组合。
有什么我遗漏的吗?
datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")
我在您的数据上使用了下面的代码(删除了导入位)并写入了 S3。我在代码后粘贴了两个文件。我正在 运行 爬虫对您的数据进行读取后从胶水目录中读取数据。
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_Whosebug", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
print "Writing to S3 file: ", df_name
datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-Whosebug/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
主要table
advertiserCountry,advertiserId,amendReason,amended,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,"commissionAmount.currency","commissionSharingPublisherId",commissionStatus,customParameters,customerCountry,declineReason,id,ipHash, lapseTime,oldCommissionAmount,oldSaleAmount,orderRef,originalSaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate, voucherCode,voucherCodeUsed,partition_0
AT,123456,false,2018-09-05T16:31:00,iPhone,"asdsdedrfrgthyjukiloujhrdf45654565423212",www.website.at,1.5,EUR,pending,AT,321547896,-27670654789123380 ,68,false,0,654987,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise
另一个 table 交易部分
id,index,"transactionParts.val.amount","transactionParts.val.commissionAmount","transactionParts.val.commissionGroupCode","transactionParts.val.commissionGroupId","transactionParts.val.commissionGroupName"
1,0,1.0,1.5,铅,654654,铅
Glue 在基 table 中生成名为 "transactionParts" 的主键列,而事务部分 table 中的 ID 是该列的外键。如您所见,它保留了原来的 id 列。
能否请您在您的数据上尝试代码,看看它是否有效(根据您的名称更改源 table 名称)?首先尝试以 CSV 格式写入 S3 以确定是否可行。请让我知道你的发现。
我在 Glue
中有一个 DynamicFrame
,我正在使用 Relationalize
方法创建 3 个新的动态帧; root_table
、root_table_1
和 root_table_2
。
当我打印表的架构或将表插入数据库后,我注意到在 root_table
中缺少 id,因此我无法在 root_table
和其他表之间进行连接.
我尝试了所有可能的组合。
有什么我遗漏的吗?
datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")
我在您的数据上使用了下面的代码(删除了导入位)并写入了 S3。我在代码后粘贴了两个文件。我正在 运行 爬虫对您的数据进行读取后从胶水目录中读取数据。
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_Whosebug", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
print "Writing to S3 file: ", df_name
datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-Whosebug/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
主要table advertiserCountry,advertiserId,amendReason,amended,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,"commissionAmount.currency","commissionSharingPublisherId",commissionStatus,customParameters,customerCountry,declineReason,id,ipHash, lapseTime,oldCommissionAmount,oldSaleAmount,orderRef,originalSaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate, voucherCode,voucherCodeUsed,partition_0 AT,123456,false,2018-09-05T16:31:00,iPhone,"asdsdedrfrgthyjukiloujhrdf45654565423212",www.website.at,1.5,EUR,pending,AT,321547896,-27670654789123380 ,68,false,0,654987,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise
另一个 table 交易部分 id,index,"transactionParts.val.amount","transactionParts.val.commissionAmount","transactionParts.val.commissionGroupCode","transactionParts.val.commissionGroupId","transactionParts.val.commissionGroupName" 1,0,1.0,1.5,铅,654654,铅
Glue 在基 table 中生成名为 "transactionParts" 的主键列,而事务部分 table 中的 ID 是该列的外键。如您所见,它保留了原来的 id 列。
能否请您在您的数据上尝试代码,看看它是否有效(根据您的名称更改源 table 名称)?首先尝试以 CSV 格式写入 S3 以确定是否可行。请让我知道你的发现。