如何转换列类型以匹配 pyspark 中的连接数据框？

Question

我在 pyspark 中有一个空数据框，我想用它来附加来自 model.transform(test_data) 的机器学习结果在 pyspark 中 - 但后来我尝试使用联合函数来连接数据框，我得到一个列类型必须匹配错误。

这是我的代码：

sc = SparkContext.getOrCreate()
spark = SparkSession(sc) 

schema = StructType([
    StructField("row_num",IntegerType(),True),
    StructField("label",IntegerType(),True),
    StructField("probability",DoubleType(),True),
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)

model = LogisticRegression().fit(train_data)

preds = model.transform(test_data)

all_preds = empty.unionAll(preds)

AnalysisException: Union can only be performed on tables with the compatible column types. 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> double at the third column of the second table;

我已经尝试转换我的空数据框的类型来匹配，但是它无法获得相同的类型 - 有什么办法可以解决这个问题吗？我的目标是让机器学习运行在 for 循环中迭代，每个预测输出附加到 pyspark 数据帧。

作为参考，preds 看起来像：

preds.printSchema()
root
 |-- row_num: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- probability: vector (nullable = true)

Answer 1

您可以根据 preds 数据框的模式创建一个空数据框：

model = LogisticRegression().fit(train_data)
preds = model.transform(test_data)
empty = spark.createDataFrame(sc.emptyRDD(), preds.schema)
all_preds = empty.unionAll(preds)

如何转换列类型以匹配 pyspark 中的连接数据框？

How to convert column types to match joining dataframes in pyspark?

python

apache-spark

pyspark

apache-spark-ml