我是 pyspark 的新手。我如何使用 pyspark 执行以下操作?
I am new to pyspark. how do i do the following using pyspark?
我有如下的 spark 数据框:
+-----------+--------------------+-----+
| Index| lst|value|
+-----------+--------------------+-----+
| 1|[3,5,6,7] | 1|
| 2|[2,6,8,1,2,3,4,5,7] | 5|
| 3|[5,6,7,5,4,3,2,3,1] | 2|
| 4|[8,7,6,4,3,2,3] | 6|
+-----------+--------------------+-----+
我需要得到以下信息:
+-----------+--------------------+-----+-----------+
| Index| lst|value|index_value|
+-----------+--------------------+-----+-----------+
| 1|[3,5,6,7] | 1| 5|
| 2|[2,6,8,1,2,3,4,5,7] | 5| 3|
| 3|[5,6,7,5,4,3,2,3,1] | 2| 7|
| 4|[8,7,6,4,3,2,3] | 6| 3|
+-----------+--------------------+-----+-----------+
我已经使用 udf 尝试获取值,但未能成功实现。我知道这是一个非常基本的问题,但我能够使用 pandas 完成所需的操作,但需要使用 pyspark 完成任务。这个例子是我手头的数据样本。
如果您本质上是尝试根据 value
中指定的索引值从 lst
获取值,您可以通过 getItem 实现此目的,它是索引 ArrayType
数据准备
s = StringIO("""
Index|lst|value
1|3,5,6,7|1
2|2,6,8,1,2,3,4,5,7|5
3|5,6,7,5,4,3,2,3,1|2
4|8,7,6,4,3,2,3|6
""")
df = pd.read_csv(s,delimiter='|')
sparkDF = sql.createDataFrame(df)
sparkDF = sparkDF.withColumn("lst", F.split(F.col("lst"), ",").cast("array<int>"))
sparkDF.show()
+-----+--------------------+-----+
|Index| lst|value|
+-----+--------------------+-----+
| 1| [3, 5, 6, 7]| 1|
| 2|[2, 6, 8, 1, 2, 3...| 5|
| 3|[5, 6, 7, 5, 4, 3...| 2|
| 4|[8, 7, 6, 4, 3, 2...| 6|
+-----+--------------------+-----+
sparkDF.printSchema()
root
|-- Index: long (nullable = true)
|-- lst: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- value: long (nullable = true)
获取物品
sparkDF = sparkDF.withColumn("index_value", F.col("lst").getItem(F.col('value')))
sparkDF.show(truncate=False)
+-----+---------------------------+-----+-----------+
|Index|lst |value|index_value|
+-----+---------------------------+-----+-----------+
|1 |[3, 5, 6, 7] |1 |5 |
|2 |[2, 6, 8, 1, 2, 3, 4, 5, 7]|5 |3 |
|3 |[5, 6, 7, 5, 4, 3, 2, 3, 1]|2 |7 |
|4 |[8, 7, 6, 4, 3, 2, 3] |6 |3 |
+-----+---------------------------+-----+-----------+
我有如下的 spark 数据框:
+-----------+--------------------+-----+
| Index| lst|value|
+-----------+--------------------+-----+
| 1|[3,5,6,7] | 1|
| 2|[2,6,8,1,2,3,4,5,7] | 5|
| 3|[5,6,7,5,4,3,2,3,1] | 2|
| 4|[8,7,6,4,3,2,3] | 6|
+-----------+--------------------+-----+
我需要得到以下信息:
+-----------+--------------------+-----+-----------+
| Index| lst|value|index_value|
+-----------+--------------------+-----+-----------+
| 1|[3,5,6,7] | 1| 5|
| 2|[2,6,8,1,2,3,4,5,7] | 5| 3|
| 3|[5,6,7,5,4,3,2,3,1] | 2| 7|
| 4|[8,7,6,4,3,2,3] | 6| 3|
+-----------+--------------------+-----+-----------+
我已经使用 udf 尝试获取值,但未能成功实现。我知道这是一个非常基本的问题,但我能够使用 pandas 完成所需的操作,但需要使用 pyspark 完成任务。这个例子是我手头的数据样本。
如果您本质上是尝试根据 value
中指定的索引值从 lst
获取值,您可以通过 getItem 实现此目的,它是索引 ArrayType
数据准备
s = StringIO("""
Index|lst|value
1|3,5,6,7|1
2|2,6,8,1,2,3,4,5,7|5
3|5,6,7,5,4,3,2,3,1|2
4|8,7,6,4,3,2,3|6
""")
df = pd.read_csv(s,delimiter='|')
sparkDF = sql.createDataFrame(df)
sparkDF = sparkDF.withColumn("lst", F.split(F.col("lst"), ",").cast("array<int>"))
sparkDF.show()
+-----+--------------------+-----+
|Index| lst|value|
+-----+--------------------+-----+
| 1| [3, 5, 6, 7]| 1|
| 2|[2, 6, 8, 1, 2, 3...| 5|
| 3|[5, 6, 7, 5, 4, 3...| 2|
| 4|[8, 7, 6, 4, 3, 2...| 6|
+-----+--------------------+-----+
sparkDF.printSchema()
root
|-- Index: long (nullable = true)
|-- lst: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- value: long (nullable = true)
获取物品
sparkDF = sparkDF.withColumn("index_value", F.col("lst").getItem(F.col('value')))
sparkDF.show(truncate=False)
+-----+---------------------------+-----+-----------+
|Index|lst |value|index_value|
+-----+---------------------------+-----+-----------+
|1 |[3, 5, 6, 7] |1 |5 |
|2 |[2, 6, 8, 1, 2, 3, 4, 5, 7]|5 |3 |
|3 |[5, 6, 7, 5, 4, 3, 2, 3, 1]|2 |7 |
|4 |[8, 7, 6, 4, 3, 2, 3] |6 |3 |
+-----+---------------------------+-----+-----------+