基于 JSON 结构更新数据框中的值
update values in dataframe based on JSON structure
我在 GLUE 数据帧上使用 Relationalize 时遇到问题。问题是由于我收到的数据结构。您可以在下面看到 printSchema()
的结果
root
...
|-- details: struct
...
| |-- reviews: struct
| |-- customerType: choice
| | |-- string
| | |-- struct
| | | |-- brandId: string
| | | |-- id: string
...
负责动态框架中结果的示例数据如下所示:
type one:
"customerType": {
"name_JP": "管理見込",
"id": "002",
"brand": "XXX",
"brandId": "XXX#002",
"name_EN": "Managed"
},
type two:
"customerType": "",
我的想法是将空字符串更新为 None 或空结构对象。我尝试使用以下代码,但它失败了,我不清楚如何解决它。
import pyspark.sql.functions as F
from pyspark.sql.types import *
new_df = case_details.toDF()
new_df = new_df.select('*', 'details.reviews.*') \
.withColumn("generalReason", F.when(str(F.col("generalReason")) == F.lit(""), StructType()).otherwise(F.col("generalReason"))) \
.drop(*new_df.select('details.reviews.*').columns)
m_df = DynamicFrame.fromDF(new_df, glueContext, "m_df")
m_df.toDF().printSchema()
看了一段时间的AWS文档,找到了正确的做法。
case_details = case_details.resolveChoice(
specs=[
("details.reviews.generalReason", "project:struct"),
("details.reviews.rejectedList.reason", "project:struct"),
("details.customerType", "project:struct"),
("details.businessCategory", "project:struct"),
("details.doctor", "project:struct"),
("details.ownerOutletpName", "project:struct"),
("details.ownerOutletpName.latitude", "cast:double"),
("details.ownerOutletpName.longitude", "cast:double"),
],
transformation_ctx = "case_details_resolveChoice"
)
我在 GLUE 数据帧上使用 Relationalize 时遇到问题。问题是由于我收到的数据结构。您可以在下面看到 printSchema()
的结果root
...
|-- details: struct
...
| |-- reviews: struct
| |-- customerType: choice
| | |-- string
| | |-- struct
| | | |-- brandId: string
| | | |-- id: string
...
负责动态框架中结果的示例数据如下所示:
type one:
"customerType": {
"name_JP": "管理見込",
"id": "002",
"brand": "XXX",
"brandId": "XXX#002",
"name_EN": "Managed"
},
type two:
"customerType": "",
我的想法是将空字符串更新为 None 或空结构对象。我尝试使用以下代码,但它失败了,我不清楚如何解决它。
import pyspark.sql.functions as F
from pyspark.sql.types import *
new_df = case_details.toDF()
new_df = new_df.select('*', 'details.reviews.*') \
.withColumn("generalReason", F.when(str(F.col("generalReason")) == F.lit(""), StructType()).otherwise(F.col("generalReason"))) \
.drop(*new_df.select('details.reviews.*').columns)
m_df = DynamicFrame.fromDF(new_df, glueContext, "m_df")
m_df.toDF().printSchema()
看了一段时间的AWS文档,找到了正确的做法。
case_details = case_details.resolveChoice(
specs=[
("details.reviews.generalReason", "project:struct"),
("details.reviews.rejectedList.reason", "project:struct"),
("details.customerType", "project:struct"),
("details.businessCategory", "project:struct"),
("details.doctor", "project:struct"),
("details.ownerOutletpName", "project:struct"),
("details.ownerOutletpName.latitude", "cast:double"),
("details.ownerOutletpName.longitude", "cast:double"),
],
transformation_ctx = "case_details_resolveChoice"
)