如何拆分包含字符串的数据框列
How do I split a dataframe column that contains strings
我正在使用 spark 2.2 和
我正在尝试从 tsv 文件中读取数据集,如下所示:
student_id subjects result
"1001" "[physics, chemistry]" "pass"
"1001" "[biology, math]" "fail"
"1002" "[economics]" "pass"
"1002" "[physics, chemistry]" "fail"
我想要以下结果:
student_id subject result
"1001" "physics" "pass"
"1001" "chemistry" "pass"
"1001" "biology" "fail"
"1001" "math" "fail"
"1002" "economics" "pass"
"1002" "physics" "fail"
"1002" "chemistry" "fail"
我做了以下操作,但似乎不起作用
df = spark.read.format("csv").option("header", "true").option("mode", "FAILFAST") \
.option("inferSchema", "true").option("sep", ' ').load("ds3.tsv")
df.printSchema()
我在执行 "printSchema"
时看到以下结果
root
|-- student_id: integer (nullable = true)
|-- subjects: string (nullable = true)
|-- result: string (nullable = true)
当我执行以下操作时,即使用爆炸功能:
df.withColumn("subject", explode(col("subjects"))).select("student_id", "subject", "result").show(2)
我得到以下异常:
AnalysisException: "cannot resolve 'explode(`subjects`)' due to data type mismatch: input to function explode should be array or map type, not string;;\n'Project [student_id#10, subjects#11, results#12, explode(subjects#11) AS subject#30]\n+- AnalysisBarrier\n +- Relation[student_id#10,subjects#11,result#12] csv\n"
我在某处读到 pyspark 不支持字符串的 ArrayType。
编写一个 UDF 是个好主意,该 UDF 将从 "subjects" 列值的两端 trim "[]" 个字符,然后使用 "split" 函数并使用"explode"?
第二列是String,可以拆分,然后"explode"使用:
val df = List(
("1001", "[physics, chemistry]", "pass"),
("1001", "[biology, math]", "fail"),
("1002", "[economics]", "pass"),
("1002", "[physics, chemistry]", "fail")
).toDF("student_id", "subjects", "result")
df
.withColumn("clearedFromLeftBracket", expr("substring(subjects,2,length(subjects))"))
.withColumn("clearedFromBrackets", expr("substring(clearedFromLeftBracket,1,length(clearedFromLeftBracket)-1)"))
.withColumn("splitted", split($"clearedFromBrackets", ", "))
.withColumn("subjectResult", explode($"splitted"))
.drop("clearedFromLeftBracket", "clearedFromBrackets", "splitted","subjects")
输出:
+----------+------+-------------+
|student_id|result|subjectResult|
+----------+------+-------------+
|1001 |pass |physics |
|1001 |pass |chemistry |
|1001 |fail |biology |
|1001 |fail |math |
|1002 |pass |economics |
|1002 |fail |physics |
|1002 |fail |chemistry |
+----------+------+-------------+
我正在使用 spark 2.2 和 我正在尝试从 tsv 文件中读取数据集,如下所示:
student_id subjects result
"1001" "[physics, chemistry]" "pass"
"1001" "[biology, math]" "fail"
"1002" "[economics]" "pass"
"1002" "[physics, chemistry]" "fail"
我想要以下结果:
student_id subject result
"1001" "physics" "pass"
"1001" "chemistry" "pass"
"1001" "biology" "fail"
"1001" "math" "fail"
"1002" "economics" "pass"
"1002" "physics" "fail"
"1002" "chemistry" "fail"
我做了以下操作,但似乎不起作用
df = spark.read.format("csv").option("header", "true").option("mode", "FAILFAST") \
.option("inferSchema", "true").option("sep", ' ').load("ds3.tsv")
df.printSchema()
我在执行 "printSchema"
时看到以下结果 root
|-- student_id: integer (nullable = true)
|-- subjects: string (nullable = true)
|-- result: string (nullable = true)
当我执行以下操作时,即使用爆炸功能:
df.withColumn("subject", explode(col("subjects"))).select("student_id", "subject", "result").show(2)
我得到以下异常:
AnalysisException: "cannot resolve 'explode(`subjects`)' due to data type mismatch: input to function explode should be array or map type, not string;;\n'Project [student_id#10, subjects#11, results#12, explode(subjects#11) AS subject#30]\n+- AnalysisBarrier\n +- Relation[student_id#10,subjects#11,result#12] csv\n"
我在某处读到 pyspark 不支持字符串的 ArrayType。
编写一个 UDF 是个好主意,该 UDF 将从 "subjects" 列值的两端 trim "[]" 个字符,然后使用 "split" 函数并使用"explode"?
第二列是String,可以拆分,然后"explode"使用:
val df = List(
("1001", "[physics, chemistry]", "pass"),
("1001", "[biology, math]", "fail"),
("1002", "[economics]", "pass"),
("1002", "[physics, chemistry]", "fail")
).toDF("student_id", "subjects", "result")
df
.withColumn("clearedFromLeftBracket", expr("substring(subjects,2,length(subjects))"))
.withColumn("clearedFromBrackets", expr("substring(clearedFromLeftBracket,1,length(clearedFromLeftBracket)-1)"))
.withColumn("splitted", split($"clearedFromBrackets", ", "))
.withColumn("subjectResult", explode($"splitted"))
.drop("clearedFromLeftBracket", "clearedFromBrackets", "splitted","subjects")
输出:
+----------+------+-------------+
|student_id|result|subjectResult|
+----------+------+-------------+
|1001 |pass |physics |
|1001 |pass |chemistry |
|1001 |fail |biology |
|1001 |fail |math |
|1002 |pass |economics |
|1002 |fail |physics |
|1002 |fail |chemistry |
+----------+------+-------------+