我是 pyspark 的新手。我如何使用 pyspark 执行以下操作?

I am new to pyspark. how do i do the following using pyspark?

我有如下的 spark 数据框:

+-----------+--------------------+-----+
|      Index|                 lst|value|
+-----------+--------------------+-----+
|          1|[3,5,6,7]           |    1|
|          2|[2,6,8,1,2,3,4,5,7] |    5|
|          3|[5,6,7,5,4,3,2,3,1] |    2|
|          4|[8,7,6,4,3,2,3]     |    6|
+-----------+--------------------+-----+ 

我需要得到以下信息:

+-----------+--------------------+-----+-----------+
|      Index|                 lst|value|index_value|
+-----------+--------------------+-----+-----------+
|          1|[3,5,6,7]           |    1|          5|
|          2|[2,6,8,1,2,3,4,5,7] |    5|          3|
|          3|[5,6,7,5,4,3,2,3,1] |    2|          7|
|          4|[8,7,6,4,3,2,3]     |    6|          3|
+-----------+--------------------+-----+-----------+

我已经使用 udf 尝试获取值,但未能成功实现。我知道这是一个非常基本的问题,但我能够使用 pandas 完成所需的操作,但需要使用 pyspark 完成任务。这个例子是我手头的数据样本。

如果您本质上是尝试根据 value 中指定的索引值从 lst 获取值,您可以通过 getItem 实现此目的,它是索引 ArrayType

数据准备

s = StringIO("""
Index|lst|value
1|3,5,6,7|1
2|2,6,8,1,2,3,4,5,7|5
3|5,6,7,5,4,3,2,3,1|2
4|8,7,6,4,3,2,3|6
""")

df = pd.read_csv(s,delimiter='|')

sparkDF = sql.createDataFrame(df)

sparkDF = sparkDF.withColumn("lst", F.split(F.col("lst"), ",").cast("array<int>"))

sparkDF.show()

+-----+--------------------+-----+
|Index|                 lst|value|
+-----+--------------------+-----+
|    1|        [3, 5, 6, 7]|    1|
|    2|[2, 6, 8, 1, 2, 3...|    5|
|    3|[5, 6, 7, 5, 4, 3...|    2|
|    4|[8, 7, 6, 4, 3, 2...|    6|
+-----+--------------------+-----+

sparkDF.printSchema()

root
 |-- Index: long (nullable = true)
 |-- lst: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- value: long (nullable = true)

获取物品

sparkDF = sparkDF.withColumn("index_value", F.col("lst").getItem(F.col('value')))

sparkDF.show(truncate=False)

+-----+---------------------------+-----+-----------+
|Index|lst                        |value|index_value|
+-----+---------------------------+-----+-----------+
|1    |[3, 5, 6, 7]               |1    |5          |
|2    |[2, 6, 8, 1, 2, 3, 4, 5, 7]|5    |3          |
|3    |[5, 6, 7, 5, 4, 3, 2, 3, 1]|2    |7          |
|4    |[8, 7, 6, 4, 3, 2, 3]      |6    |3          |
+-----+---------------------------+-----+-----------+