有没有办法让胶水作业始终从 xml 读取数据作为字符串？

Question

我有这个 xml，我可以通过 AWS Glue 读取并插入到 RDS。下面是示例 xml.

<VENDOR>
  <DETAILS>
    <RECORD>
      <VENDOR_NUMBER>123456D</VENDOR_NUMBER>
      <VENDOR_NAME>STORE 1</VENDOR_NAME>
    </RECORD>
    <RECORD>
      <VENDOR_NUMBER>123456</VENDOR_NUMBER>
      <VENDOR_NAME>STORE 2</VENDOR_NAME>
    </RECORD>
    <RECORD>
      <VENDOR_NUMBER>123456C</VENDOR_NUMBER>
      <VENDOR_NAME>STORE 3</VENDOR_NAME>
    </RECORD>
  </DETAILS>
  <TRAILER>
    <TOTAL_RECORD>00003</TOTAL_RECORD>
  </TRAILER>
</VENDOR>

出于某种原因，从 xml 创建的动态框架内的列始终是结构类型。下面是 printschema 结果和示例代码

datasource = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "table_name", transformation_ctx = "datasource")

datasource.printSchema()

root
 |-- VENDOR_NAME: string (nullable = true)
 |-- VENDOR_NUMBER: struct (nullable = true)
 |    |-- double: double (nullable = true)
 |    |-- int: integer (nullable = true)
 |    |-- string: string (nullable = true)

我尝试添加一个解析选项将数据转换为字符串，它适用于 int 类型，但不适用于 double 类型，因为原始数据是 123456D，但它不知何故变成了双重 123456.0。以下是 RDS 中的示例脚本和结果。

resolvechoice = ResolveChoice.apply(frame = datasource, choice = "cast:string", transformation_ctx = "resolvechoice")

VENDOR_NUMBER   VENDOR_NAME
123456.0        STORE 1
123456          STORE 2
123456G         STORE 3

我还尝试更新数据目录中 table 的架构，将所有字段的数据类型更改为字符串，还在胶合爬虫配置选项中选择忽略架构更改的选项，但它没有用。以下来自爬虫选项

Configuration options
Schema updates in the data store        Ignore the change and don't update the table in the data catalog.
Inherit schema from table               Update all new and existing partitions with metadata from the table.
Object deletion in the data store       Mark the table as deprecated in the data catalog.

有没有办法让胶水作业始终以字符串形式从 xml 读取数据？

Answer 1

AWS 论坛有人回答了我的问题。我在这里发布解决方案以防万一有人需要它。

我使用 spark-xml 生成 DataFrame 而不是 DynamicFrame。

df = spark.read.format('xml') \
    .option("rowTag", "RECORD") \
    .load("s3://bucket/glue/input-xml/")

df.printSchema()
df.show()

root
 |-- VENDOR_NAME: string (nullable = true)
 |-- VENDOR_NUMBER: string (nullable = true)

+-----------+-------------+
|VENDOR_NAME|VENDOR_NUMBER|
+-----------+-------------+
|    STORE 1|      123456D|
|    STORE 2|       123456|
|    STORE 3|      123456C|
+-----------+-------------+

为此，您需要下载 spark-xml JAR 文件，将其上传到 S3，并将其添加到 Glue 作业的 'Dependent jars path' 中。 https://mvnrepository.com/artifact/com.databricks/spark-xml_2.11/0.7.0 https://docs.aws.amazon.com/en_pv/glue/latest/dg/add-job.html

有没有办法让胶水作业始终从 xml 读取数据作为字符串？

Is there a way to make glue job to always read data from xml as string?

pyspark-sql

aws-glue