PySpark

Question

我正在尝试在数据格式如下的列上使用 PySpark 的 split() 方法：

[6b87587f-54d4-11eb-95a7-8cdcd41d1310, 603, landing-content, landing-content-provider]

我的目的是提取最后一个逗号后的第 4 个元素。

我使用的语法如下：

mydf.select("primary_component").withColumn("primary_component_01",f.split(mydf.primary_component, "\,").getItem(0)).limit(10).show(truncate=False)

但我一直收到此错误：

"cannot resolve 'split(mydf.primary_component, ',')' due to data type mismatch: argument 1 requires string type, however, 'mydf.primary_component' is of structuuid:string,id:int,project:string,component:string type.;;\n'Project [primary_component#17, split(split(primary_component#17, ,)[1], \,)...

我也试过使用 \, \\ 转义“,”或根本不转义它，这没有任何区别。此外，删除“.getItem(0)”不会产生任何差异。

我做错了什么？感觉很笨，但我不知道如何解决这个问题...... 感谢您的任何建议

Answer 1

您遇到错误：

"cannot resolve 'split(mydf.`primary_component`, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.`primary_component`' is of
struct<uuid:string,id:int,project:string,component:string>

因为您的列 primary_component 使用的是结构类型，而 split 需要 string 列。

由于 primary_component 已经是一个结构，并且您对最后一个逗号后的值感兴趣，您可以使用点符号尝试以下操作

mydf.withColumn("primary_component_01","primary_component.component")

在错误消息中，spark 已将您的结构的架构共享为

struct<uuid:string,id:int,project:string,component:string>

即

column	data type
uuid	string
id	int
project	string
component	string

为了将来的调试目的，您可以使用 mydf.printSchema() 来显示正在使用的 spark 数据帧的架构。

PySpark - 尝试拆分列内容时出现数据不匹配错误

PySpark - data mismatch error when trying to split a column content

split

apache-spark