重塑然后在火花数据框中分解数组
Reshape THEN explode an array in a spark dataframe
我有以下格式的 spark 数据框:
+--------------------+--------------------+
| profiles|__record_timestamp__|
+--------------------+--------------------+
|[0, 1, 1, 1, 3, 1...| 1651737406300000000|
|[1, 0, 1, 2, 1, 0...| 1651736986300000000|
|[2, 1, 3, 1, 0, 0...| 1651737232300000000|
|[1, 1, 3, 1, 2, 0...| 1651737352300000000|
|[0, 1, 0, 0, 0, 1...| 1651737412300000000|
|[0, 1, 0, 1, 1, 1...| 1651737142300000000|
|[3, 1, 0, 1, 1, 1...| 1651737574300000000|
|[2, 0, 3, 1, 0, 1...| 1651737178300000000|
|[0, 0, 0, 1, 2, 1...| 1651737364300000000|
|[0, 0, 1, 0, 0, 0...| 1651737280300000000|
|[1, 0, 0, 1, 0, 0...| 1651737196300000000|
|[0, 0, 0, 0, 0, 1...| 1651737436300000000|
|[8, 2, 0, 0, 0, 3...| 1651737166300000000|
|[4, 0, 1, 2, 0, 0...| 1651737538300000000|
|[1, 2, 0, 1, 1, 0...| 1651737052300000000|
|[1, 3, 0, 1, 0, 1...| 1651737082300000000|
|[1, 1, 1, 2, 0, 0...| 1651737100300000000|
|[1, 0, 0, 0, 1, 0...| 1651736980300000000|
|[1, 1, 0, 0, 0, 0...| 1651737040300000000|
|[1, 0, 1, 0, 1, 1...| 1651737004300000000|
+--------------------+--------------------+
only showing top 20 rows
配置文件中的数组长度为 91260。我需要先将它重新整形为 90*1024 数组然后我打算展开,每个数组都有一个 0-89 的整数与其在原始数组中的位置匹配。
知道如何做到这一点吗? f.explode() 每列只会给我 1 个元素,split() 似乎只适用于字符串,我找不到重塑或 array_split() 函数或任何东西。 TIA
我想你要找的是posexplode:
from pyspark.sql import Row
eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
eDF.select(posexplode(eDF.intlist)).collect()
[Row(pos=0, col=1), Row(pos=1, col=2), Row(pos=2, col=3)]
您可以通过创建所需 slices and then using posexplode 函数的数组来实现。为了实用,我创建了一个较小的 df 来展示它是如何工作的
import pyspark.sql.functions as F
from pyspark.sql import Row
SPLIT_COUNT = 3
SPLIT_SIZE = 5
ARRAY_SIZE = SPLIT_COUNT * SPLIT_SIZE
df = spark.createDataFrame([
Row(profiles=list(range(ARRAY_SIZE)), timestamp=12345)
])
slices = [F.slice(F.col('profiles'), i * SPLIT_SIZE + 1, SPLIT_SIZE) for i in range(SPLIT_COUNT)]
df.select(
F.posexplode(F.array(*slices)),
F.col('timestamp')
).show()
+---+--------------------+---------+
|pos| col|timestamp|
+---+--------------------+---------+
| 0| [0, 1, 2, 3, 4]| 12345|
| 1| [5, 6, 7, 8, 9]| 12345|
| 2|[10, 11, 12, 13, 14]| 12345|
+---+--------------------+---------+
我有以下格式的 spark 数据框:
+--------------------+--------------------+
| profiles|__record_timestamp__|
+--------------------+--------------------+
|[0, 1, 1, 1, 3, 1...| 1651737406300000000|
|[1, 0, 1, 2, 1, 0...| 1651736986300000000|
|[2, 1, 3, 1, 0, 0...| 1651737232300000000|
|[1, 1, 3, 1, 2, 0...| 1651737352300000000|
|[0, 1, 0, 0, 0, 1...| 1651737412300000000|
|[0, 1, 0, 1, 1, 1...| 1651737142300000000|
|[3, 1, 0, 1, 1, 1...| 1651737574300000000|
|[2, 0, 3, 1, 0, 1...| 1651737178300000000|
|[0, 0, 0, 1, 2, 1...| 1651737364300000000|
|[0, 0, 1, 0, 0, 0...| 1651737280300000000|
|[1, 0, 0, 1, 0, 0...| 1651737196300000000|
|[0, 0, 0, 0, 0, 1...| 1651737436300000000|
|[8, 2, 0, 0, 0, 3...| 1651737166300000000|
|[4, 0, 1, 2, 0, 0...| 1651737538300000000|
|[1, 2, 0, 1, 1, 0...| 1651737052300000000|
|[1, 3, 0, 1, 0, 1...| 1651737082300000000|
|[1, 1, 1, 2, 0, 0...| 1651737100300000000|
|[1, 0, 0, 0, 1, 0...| 1651736980300000000|
|[1, 1, 0, 0, 0, 0...| 1651737040300000000|
|[1, 0, 1, 0, 1, 1...| 1651737004300000000|
+--------------------+--------------------+
only showing top 20 rows
配置文件中的数组长度为 91260。我需要先将它重新整形为 90*1024 数组然后我打算展开,每个数组都有一个 0-89 的整数与其在原始数组中的位置匹配。
知道如何做到这一点吗? f.explode() 每列只会给我 1 个元素,split() 似乎只适用于字符串,我找不到重塑或 array_split() 函数或任何东西。 TIA
我想你要找的是posexplode:
from pyspark.sql import Row eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})]) eDF.select(posexplode(eDF.intlist)).collect() [Row(pos=0, col=1), Row(pos=1, col=2), Row(pos=2, col=3)]
您可以通过创建所需 slices and then using posexplode 函数的数组来实现。为了实用,我创建了一个较小的 df 来展示它是如何工作的
import pyspark.sql.functions as F
from pyspark.sql import Row
SPLIT_COUNT = 3
SPLIT_SIZE = 5
ARRAY_SIZE = SPLIT_COUNT * SPLIT_SIZE
df = spark.createDataFrame([
Row(profiles=list(range(ARRAY_SIZE)), timestamp=12345)
])
slices = [F.slice(F.col('profiles'), i * SPLIT_SIZE + 1, SPLIT_SIZE) for i in range(SPLIT_COUNT)]
df.select(
F.posexplode(F.array(*slices)),
F.col('timestamp')
).show()
+---+--------------------+---------+
|pos| col|timestamp|
+---+--------------------+---------+
| 0| [0, 1, 2, 3, 4]| 12345|
| 1| [5, 6, 7, 8, 9]| 12345|
| 2|[10, 11, 12, 13, 14]| 12345|
+---+--------------------+---------+