如何使用 Spark/Scala 在 DataFrame 行中创建嵌套 JSON 对象的计数

Question

我有一列 JSON 个对象字符串，如下所示：

"steps":{
    "step_1":{
        "conditions":{
        "complete_by":"2022-05-17",
        "requirement":100
                     },
        "status":"eligible",
        "type":"buy"
            },
    "step_2":{
        "conditions":{
        "complete_by":"2022-05-27",
        "requirement":100
                     },
        "status":"eligible",
        "type":"buy" 
}

在步骤对象中，可以有任意数量的步骤（在合理范围内）。

我的问题是，如何创建另一个 Dataframe 列来计算 row/column 中每个 JSON 字符串的步数？

我正在使用 Spark/Scala，所以我使用以下内容创建了一个 UDF：

def jsonCount (col):

val jsonCountUDF = udf(jsonCount)

val stepDF = stepData.withColumn("NumberOfSteps", jsonCountUDF(col("steps")))

这就是我卡住的地方。我想遍历步骤列中的每一行并计算步骤对象 JSON 字符串中的步骤对象。有没有人有类似任务的经验或知道简化此任务的功能？

Answer 1

您可以尝试 select 该子结构，然后获取列大小。

  stepSize=  df.select($"steps.*").columns.size

然后将其添加到您的 df

df_steps = df.withColumn("NumberOfSteps",lit(stepSize))

编辑：不要为此目的使用 UDF ...

Answer 2

#make some data
str = "{\"steps\":{ \"step_1\":{\"conditions\":{ \"complete_by\":\"2022-05-17\", \"requirement\":100} }  , \"step_2\":{  \"status\":\"eligible\", \"type\":\"buy\"   }  }}"

#implement a function to return the count
def jsonCount ( jsonString ):
 import json
 json_obj = json.loads(jsonString)
 return len( json_obj["steps"] )

#define the udf
JSONCount = udf(jsonCount, IntegerType())

#create sample dataframe
df = spark.createDataFrame( [ [str] ], ["json"] )

#run udf on dataframe
df.select( df.json, JSONCount( df.json ).alias("StepCount") ).show()

+--------------------+---------+
|                json|StepCount|
+--------------------+---------+
|{"steps":{ "step_...|        2|
+--------------------+---------+

如何使用 Spark/Scala 在 DataFrame 行中创建嵌套 JSON 对象的计数

How to create a count of nested JSON objects in a DataFrame row using Spark/Scala

json

scala

apache-spark

apache-spark-sql