在 Spark 数据框列中用引号过滤字符串
filter string with quotes in Spark dataframe column
我有一个包含此数据的 DF:
--------+------------------------------------------+
|recType |value |
+--------+------------------------------------------+
|{"id": 1|{"id": 1, "user_id": 100, "price": 50} |
...
我可以使用 contains
过滤 recType,但是如何使用 ===
和引号?我似乎每次都会遇到一些错误。
我知道这里的列是字符串。如果是这样,from_json
函数可以将它们解析为结构。
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}
import org.apache.spark.sql.functions.from_json
val recTypeSchema = StructType(Array(
StructField("id", IntegerType, true)
))
val valueSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("user_id", IntegerType, true),
StructField("price", IntegerType, true)
))
val parsedDf = df
.withColumn("recType", from_json($"recType", recTypeSchema))
.withColumn("value", from_json($"value", valueSchema))
parsedDf.printSchema
root
|-- recType: struct (nullable = true)
| |-- id: integer (nullable = true)
|-- value: struct (nullable = true)
| |-- id: integer (nullable = true)
| |-- user_id: integer (nullable = true)
| |-- price: integer (nullable = true)
parsedDf.filter($"recType.id" === 1).show
+-------+------------+
|recType| value|
+-------+------------+
| {1}|{1, 100, 50}|
+-------+------------+
我有一个包含此数据的 DF:
--------+------------------------------------------+
|recType |value |
+--------+------------------------------------------+
|{"id": 1|{"id": 1, "user_id": 100, "price": 50} |
...
我可以使用 contains
过滤 recType,但是如何使用 ===
和引号?我似乎每次都会遇到一些错误。
我知道这里的列是字符串。如果是这样,from_json 函数可以将它们解析为结构。
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}
import org.apache.spark.sql.functions.from_json
val recTypeSchema = StructType(Array(
StructField("id", IntegerType, true)
))
val valueSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("user_id", IntegerType, true),
StructField("price", IntegerType, true)
))
val parsedDf = df
.withColumn("recType", from_json($"recType", recTypeSchema))
.withColumn("value", from_json($"value", valueSchema))
parsedDf.printSchema
root
|-- recType: struct (nullable = true)
| |-- id: integer (nullable = true)
|-- value: struct (nullable = true)
| |-- id: integer (nullable = true)
| |-- user_id: integer (nullable = true)
| |-- price: integer (nullable = true)
parsedDf.filter($"recType.id" === 1).show
+-------+------------+
|recType| value|
+-------+------------+
| {1}|{1, 100, 50}|
+-------+------------+