如何在CDK中添加SerDe参数？

Question

我正在使用 CDK 创建 Glue table，如下所示：

  const someTable = new Glue.Table(
      scope,
      "some-table",
      {
        tableName: "some-table",
        columns: [
          {
            name: "value",
            type: Glue.Schema.DOUBLE,
          },
          {
            name: "user_id",
            type: Glue.Schema.STRING,
          },
        ],
        partitionKeys: [
          {
            name: "region_id",
            type: Glue.Schema.BIG_INT,
          },
        ],
        database: glueDb,
        dataFormat: Glue.DataFormat.PARQUET,
        bucket: props.bucket,
      }
    );

看起来这正在按预期创建我的 Glue table，但它还在幕后做一些事情，例如设置 Serde 序列化库 (org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe)。对于我的用例，我还必须在 table 配置中指定一些 Serde 参数，但我无法在 CDK 文档 (https://docs.aws.amazon.com/cdk/api/v1/docs/@aws-cdk_aws-glue.Table.html) 中找到如何执行此操作，即使它看起来像您可以在“编辑Table”下的控制台中配置。

有没有人运行对此有任何关于如何更新的建议？

谢谢！

Answer 1

将 serde 设置传递给 Table (@aws-cdk/aws-glue-alpha) using the dataFormat (type of DataFormat) prop.

// TableProps
{
  dataFormat:  glue.DataFormat.PARQUET
}

对于 finer-grained 控件，使用 L1 CfnTable (aws-cdk-lib) construct, whose API matches the CloudFormation AWS::Glue::Table 资源。

// CfnTableProps
tableInput: {
  // ...
  storageDescriptor: {
    inputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
    outputFormat: 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
    serdeInfo: {
      serializationLibrary: 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe',
      parameters: { 'serialization.format': 1 },
    },

如何在CDK中添加SerDe参数？

How to add SerDe parameters in CDK?

amazon-web-services

aws-glue

aws-cdk

aws-glue-data-catalog