PySpark:如何为数组列中的每个元素添加值?
PySpark: How to add value to each element in a column of arrays?
我在 PySpark 中有一个 DF 数组列,我想在其中将数字 1 添加到每个数组中的每个元素。这是 DF:
+--------------------+
| growth2|
+--------------------+
|[0.041305445, 0.0...|
|[0.027677462, 0.0...|
|[-0.0027841541, 0...|
|[-0.003083522, 0....|
|[0.03309798, -0.0...|
|[-0.0030860472, 0...|
|[0.01870109, -0.0...|
|[0.0, 0.0, 0.0, 0...|
|[0.030841235, 0.0...|
|[-0.07487654, 0.0...|
|[-0.0030791108, 0...|
|[0.010564512, 0.0...|
|[0.017113779, 0.0...|
|[-0.0030568982, 0...|
|[0.8942986, 0.020...|
|[0.039178953, 0.0...|
|[-0.020131985, -0...|
|[0.09150412, -0.0...|
|[0.024969723, 0.0...|
|[0.017103601, -0....|
+--------------------+
only showing top 20 rows
这是第一行:
Row(growth2=[0.041305445, 0.046466704, 0.16028039, 0.05724156, 0.03765997, 0.103110574, 0.031785928, 0.04724884, -0.028079592, 0.009382707, -0.25695816, 0.19432063, 0.061015617, 0.09409759, 0.12152613, 0.039392408, 0.989114, 0.04910219, 0.46904725, 0.0])
所以,输出看起来像:
Row(growth2=[1.041305445, 1.046466704, 1.16028039, 1.05724156, 1.03765997, 1.103110574, 1.031785928, 1.04724884, -1.028079592, 1.009382707, -1.25695816, 1.19432063, 1.061015617, 1.09409759, 1.12152613, 1.039392408, 1.989114, 1.04910219, 1.46904725, 1.0])
是否有可以完成此操作的 PySpark 函数?我想避免编写 Pandas UDF,因为我有 50+ 百万行,与本机解决方案相比,操作速度较慢。
Spark 提供 higher-order functions 原生操作数组:
import pyspark.sql.functions as f
df = df.withColumn('growth2', f.expr('TRANSFORM(growth2, el -> el + 1)'))
我在 PySpark 中有一个 DF 数组列,我想在其中将数字 1 添加到每个数组中的每个元素。这是 DF:
+--------------------+
| growth2|
+--------------------+
|[0.041305445, 0.0...|
|[0.027677462, 0.0...|
|[-0.0027841541, 0...|
|[-0.003083522, 0....|
|[0.03309798, -0.0...|
|[-0.0030860472, 0...|
|[0.01870109, -0.0...|
|[0.0, 0.0, 0.0, 0...|
|[0.030841235, 0.0...|
|[-0.07487654, 0.0...|
|[-0.0030791108, 0...|
|[0.010564512, 0.0...|
|[0.017113779, 0.0...|
|[-0.0030568982, 0...|
|[0.8942986, 0.020...|
|[0.039178953, 0.0...|
|[-0.020131985, -0...|
|[0.09150412, -0.0...|
|[0.024969723, 0.0...|
|[0.017103601, -0....|
+--------------------+
only showing top 20 rows
这是第一行:
Row(growth2=[0.041305445, 0.046466704, 0.16028039, 0.05724156, 0.03765997, 0.103110574, 0.031785928, 0.04724884, -0.028079592, 0.009382707, -0.25695816, 0.19432063, 0.061015617, 0.09409759, 0.12152613, 0.039392408, 0.989114, 0.04910219, 0.46904725, 0.0])
所以,输出看起来像:
Row(growth2=[1.041305445, 1.046466704, 1.16028039, 1.05724156, 1.03765997, 1.103110574, 1.031785928, 1.04724884, -1.028079592, 1.009382707, -1.25695816, 1.19432063, 1.061015617, 1.09409759, 1.12152613, 1.039392408, 1.989114, 1.04910219, 1.46904725, 1.0])
是否有可以完成此操作的 PySpark 函数?我想避免编写 Pandas UDF,因为我有 50+ 百万行,与本机解决方案相比,操作速度较慢。
Spark 提供 higher-order functions 原生操作数组:
import pyspark.sql.functions as f
df = df.withColumn('growth2', f.expr('TRANSFORM(growth2, el -> el + 1)'))