基于 JSON 结构更新数据框中的值

Question

我在 GLUE 数据帧上使用 Relationalize 时遇到问题。问题是由于我收到的数据结构。您可以在下面看到 printSchema()

的结果

root
...
|-- details: struct
        ...
|    |-- reviews: struct
|    |-- customerType: choice
|    |    |-- string
|    |    |-- struct
|    |    |    |-- brandId: string
|    |    |    |-- id: string
                ...

负责动态框架中结果的示例数据如下所示：

type one:
"customerType": {
   "name_JP": "管理見込",
   "id": "002",
   "brand": "XXX",
   "brandId": "XXX#002",
   "name_EN": "Managed"
 },
 
type two:
"customerType": "",

我的想法是将空字符串更新为 None 或空结构对象。我尝试使用以下代码，但它失败了，我不清楚如何解决它。

import pyspark.sql.functions as F
from pyspark.sql.types import *

new_df = case_details.toDF()

new_df = new_df.select('*', 'details.reviews.*') \
   .withColumn("generalReason", F.when(str(F.col("generalReason")) == F.lit(""), StructType()).otherwise(F.col("generalReason"))) \
   .drop(*new_df.select('details.reviews.*').columns)

m_df = DynamicFrame.fromDF(new_df, glueContext, "m_df")
m_df.toDF().printSchema()

Answer 1

看了一段时间的AWS文档，找到了正确的做法。

case_details = case_details.resolveChoice(
specs=[
    ("details.reviews.generalReason", "project:struct"),
    ("details.reviews.rejectedList.reason", "project:struct"),
    ("details.customerType", "project:struct"),
    ("details.businessCategory", "project:struct"),
    ("details.doctor", "project:struct"),
    ("details.ownerOutletpName", "project:struct"),
    ("details.ownerOutletpName.latitude", "cast:double"),
    ("details.ownerOutletpName.longitude", "cast:double"),
],
transformation_ctx = "case_details_resolveChoice"
)

基于 JSON 结构更新数据框中的值

update values in dataframe based on JSON structure

amazon-web-services

pyspark

aws-glue