在 PySpark 中将 StringType 列转换为 ArrayType
Convert StringType Column To ArrayType In PySpark
我有一个包含列 "EVENT_ID" 的数据框,其数据类型为字符串。
我正在 运行ning FPGrowth 算法但抛出以下错误
Py4JJavaError: An error occurred while calling o1711.fit.
:java.lang.IllegalArgumentException: requirement failed:
The input column must be array, but got string.
列 EVENT_ID 有值
E_34503_Probe
E_35203_In
E_31901_Cbc
我正在使用以下代码将字符串列转换为数组类型
df2 = df.withColumn("EVENT_ID", df["EVENT_ID"].cast(types.ArrayType(types.StringType())))
但是我得到以下错误
Py4JJavaError: An error occurred while calling o1874.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`EVENT_ID`' due to data type mismatch: cannot cast string to array<string>;;
如何将此列转换为数组类型或运行 字符串类型的 FPGrowth 算法?
原回答
试试下面的方法。
In [0]: from pyspark.sql.types import StringType
from pyspark.sql.functions import col, regexp_replace, split
In [1]: df = spark.createDataFrame(["E_34503_Probe", "E_35203_In", "E_31901_Cbc"], StringType()).toDF("EVENT_ID")
df.show()
Out [1]: +-------------+
| EVENT_ID|
+-------------+
|E_34503_Probe|
| E_35203_In|
| E_31901_Cbc|
+-------------+
In [2]: df_new = df.withColumn("EVENT_ID", split(regexp_replace(col("EVENT_ID"), r"(^\[)|(\]$)|(')", ""), ", "))
df_new.printSchema()
Out [2]: root
|-- EVENT_ID: array (nullable = true)
| |-- element: string (containsNull = true)
希望对您有所帮助。
已编辑答案
正如@pault 在他的 中指出的那样,更简单的解决方案如下:
In [0]: from pyspark.sql.types import StringType
from pyspark.sql.functions import array
In [1]: df = spark.createDataFrame(["E_34503_Probe", "E_35203_In", "E_31901_Cbc"], StringType()).toDF("EVENT_ID")
df.show()
Out [1]: +-------------+
| EVENT_ID|
+-------------+
|E_34503_Probe|
| E_35203_In|
| E_31901_Cbc|
+-------------+
In [2]: df_new = df.withColumn("EVENT_ID", array(df["EVENT_ID"]))
df_new.printSchema()
Out [2]: root
|-- EVENT_ID: array (nullable = false)
| |-- element: string (containsNull = true)
我有一个包含列 "EVENT_ID" 的数据框,其数据类型为字符串。 我正在 运行ning FPGrowth 算法但抛出以下错误
Py4JJavaError: An error occurred while calling o1711.fit.
:java.lang.IllegalArgumentException: requirement failed:
The input column must be array, but got string.
列 EVENT_ID 有值
E_34503_Probe
E_35203_In
E_31901_Cbc
我正在使用以下代码将字符串列转换为数组类型
df2 = df.withColumn("EVENT_ID", df["EVENT_ID"].cast(types.ArrayType(types.StringType())))
但是我得到以下错误
Py4JJavaError: An error occurred while calling o1874.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`EVENT_ID`' due to data type mismatch: cannot cast string to array<string>;;
如何将此列转换为数组类型或运行 字符串类型的 FPGrowth 算法?
原回答
试试下面的方法。
In [0]: from pyspark.sql.types import StringType
from pyspark.sql.functions import col, regexp_replace, split
In [1]: df = spark.createDataFrame(["E_34503_Probe", "E_35203_In", "E_31901_Cbc"], StringType()).toDF("EVENT_ID")
df.show()
Out [1]: +-------------+
| EVENT_ID|
+-------------+
|E_34503_Probe|
| E_35203_In|
| E_31901_Cbc|
+-------------+
In [2]: df_new = df.withColumn("EVENT_ID", split(regexp_replace(col("EVENT_ID"), r"(^\[)|(\]$)|(')", ""), ", "))
df_new.printSchema()
Out [2]: root
|-- EVENT_ID: array (nullable = true)
| |-- element: string (containsNull = true)
希望对您有所帮助。
已编辑答案
正如@pault 在他的
In [0]: from pyspark.sql.types import StringType
from pyspark.sql.functions import array
In [1]: df = spark.createDataFrame(["E_34503_Probe", "E_35203_In", "E_31901_Cbc"], StringType()).toDF("EVENT_ID")
df.show()
Out [1]: +-------------+
| EVENT_ID|
+-------------+
|E_34503_Probe|
| E_35203_In|
| E_31901_Cbc|
+-------------+
In [2]: df_new = df.withColumn("EVENT_ID", array(df["EVENT_ID"]))
df_new.printSchema()
Out [2]: root
|-- EVENT_ID: array (nullable = false)
| |-- element: string (containsNull = true)