pyspark 拆分数组并获取键值
pyspark to split array and get key values
我有包含键值对字符串数组的数据框,我只想从键值中获取键
每行的键值对数量是动态的,命名约定也不同。
Sample Input
------+-----+-----+-----+---------------------
|ID |data| value |
+------+-----+-----+--------+-----------------
|e1 |D1 |["K1":"V1","K2":"V2","K3":"V3"] |
|e2 |D2 |["K1":"V1","K3":"V3"] |
|e3 |D1 |["K1":"V1","K2":"V2"] |
|e4 |D3 |["K2":"V2","K1":"V1","K3":"V3"] |
+------+-----+-----+--------+-----------------
Expected Result:
------+-----+-----+------
|ID |data| value |
+------+-----+-----+----|
|e1 |D1 |[K1|K2|K3] |
|e2 |D2 |[K1|K3] |
|e3 |D1 |[K1|K2] |
|e4 |D3 |[K2|K1|K3] |
+------+-----+-----+-----
对于 Spark 2.4+,使用 transform
函数。
对于数组的每个元素,使用 substring_index
and trim leading and trailing quotes using trim
函数对键进行子字符串化。
df.show(truncate=False)
#+---+----+------------------------------------+
#|ID |data|value |
#+---+----+------------------------------------+
#|e1 |D1 |["K1":"V1", "K2": "V2", "K3": "V3"] |
#|e2 |D2 |["K1": "V1", "K3": "V3"] |
#|e3 |D1 |["K1": "V1", "K2": "V2"] |
#|e4 |D3 |["K2": "V2", "K1": "V1", "K3": "V3"]|
#+---+----+------------------------------------+
new_value = """ transform(value, x -> trim(BOTH '"' FROM substring_index(x, ':', 1))) """
df.withColumn("value", expr(new_value)).show()
#+---+----+------------+
#|ID |data|value |
#+---+----+------------+
#|e1 |D1 |[K1, K2, K3]|
#|e2 |D2 |[K1, K3] |
#|e3 |D1 |[K1, K2] |
#|e4 |D3 |[K2, K1, K3]|
#+---+----+------------+
如果您希望结果为由 |
分隔的字符串,您可以像这样使用 array_join
:
df.withColumn("value", array_join(expr(new_value), "|")).show()
#+---+----+--------+
#|ID |data|value |
#+---+----+--------+
#|e1 |D1 |K1|K2|K3|
#|e2 |D2 |K1|K3 |
#|e3 |D1 |K1|K2 |
#|e4 |D3 |K2|K1|K3|
#+---+----+--------+
您可以将值拆分为包含键和值的数组。
df.withColumn("keys", expr('transform(value, keyValue -> trim(split(keyValue, ":")[0]))')).drop("value")
我有包含键值对字符串数组的数据框,我只想从键值中获取键 每行的键值对数量是动态的,命名约定也不同。
Sample Input
------+-----+-----+-----+---------------------
|ID |data| value |
+------+-----+-----+--------+-----------------
|e1 |D1 |["K1":"V1","K2":"V2","K3":"V3"] |
|e2 |D2 |["K1":"V1","K3":"V3"] |
|e3 |D1 |["K1":"V1","K2":"V2"] |
|e4 |D3 |["K2":"V2","K1":"V1","K3":"V3"] |
+------+-----+-----+--------+-----------------
Expected Result:
------+-----+-----+------
|ID |data| value |
+------+-----+-----+----|
|e1 |D1 |[K1|K2|K3] |
|e2 |D2 |[K1|K3] |
|e3 |D1 |[K1|K2] |
|e4 |D3 |[K2|K1|K3] |
+------+-----+-----+-----
对于 Spark 2.4+,使用 transform
函数。
对于数组的每个元素,使用 substring_index
and trim leading and trailing quotes using trim
函数对键进行子字符串化。
df.show(truncate=False)
#+---+----+------------------------------------+
#|ID |data|value |
#+---+----+------------------------------------+
#|e1 |D1 |["K1":"V1", "K2": "V2", "K3": "V3"] |
#|e2 |D2 |["K1": "V1", "K3": "V3"] |
#|e3 |D1 |["K1": "V1", "K2": "V2"] |
#|e4 |D3 |["K2": "V2", "K1": "V1", "K3": "V3"]|
#+---+----+------------------------------------+
new_value = """ transform(value, x -> trim(BOTH '"' FROM substring_index(x, ':', 1))) """
df.withColumn("value", expr(new_value)).show()
#+---+----+------------+
#|ID |data|value |
#+---+----+------------+
#|e1 |D1 |[K1, K2, K3]|
#|e2 |D2 |[K1, K3] |
#|e3 |D1 |[K1, K2] |
#|e4 |D3 |[K2, K1, K3]|
#+---+----+------------+
如果您希望结果为由 |
分隔的字符串,您可以像这样使用 array_join
:
df.withColumn("value", array_join(expr(new_value), "|")).show()
#+---+----+--------+
#|ID |data|value |
#+---+----+--------+
#|e1 |D1 |K1|K2|K3|
#|e2 |D2 |K1|K3 |
#|e3 |D1 |K1|K2 |
#|e4 |D3 |K2|K1|K3|
#+---+----+--------+
您可以将值拆分为包含键和值的数组。
df.withColumn("keys", expr('transform(value, keyValue -> trim(split(keyValue, ":")[0]))')).drop("value")