如何在spark中合并数据框中的列表
How to coalesce a list in dataframe in spark
我有这样的代码
columns = ("language","users_count","status")
data = (("Java",None,"1"), ("Python", "100000","2"), ("Scala", "3000","3"))
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.withColumn('concat', regexp_replace(concat( coalesce(*columns)), " ", "")).show()
结果是:
+--------+-----------+------+------+
|language|users_count|status|concat|
+--------+-----------+------+------+
| Java| null| 1| Java|
| Python| 100000| 2|Python|
| Scala| 3000| 3| Scala|
+--------+-----------+------+------+
如果我想要 concat
列是 Java1
我需要编码:
df.withColumn('concat', regexp_replace(concat(
coalesce('language',lit('')),
coalesce('users_count', lit('')),
coalesce('status', lit('')) ), " ", "")).show()```
看起来像这样:
+--------+-----------+------+-------------+
|language|users_count|status| concat|
+--------+-----------+------+-------------+
| Java| null| 1| Java1|
| Python| 100000| 2|Python1000002|
| Scala| 3000| 3| Scala30003|
+--------+-----------+------+-------------+
任何人都可以帮助我修复 coalesce(*columns)
,这样我就不需要为 columns
中的每一列编写所有合并吗?谢谢
使用concat_ws
>>> df.show()
+--------+-----------+------+
|language|users_count|status|
+--------+-----------+------+
| Java| null| 1|
| Python| 100000| 2|
| Scala| 3000| 3|
+--------+-----------+------+
>>> df.withColumn('concat', concat_ws("",*columns)).show()
+--------+-----------+------+-------------+
|language|users_count|status| concat|
+--------+-----------+------+-------------+
| Java| null| 1| Java1|
| Python| 100000| 2|Python1000002|
| Scala| 3000| 3| Scala30003|
+--------+-----------+------+-------------+
我有这样的代码
columns = ("language","users_count","status")
data = (("Java",None,"1"), ("Python", "100000","2"), ("Scala", "3000","3"))
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.withColumn('concat', regexp_replace(concat( coalesce(*columns)), " ", "")).show()
结果是:
+--------+-----------+------+------+
|language|users_count|status|concat|
+--------+-----------+------+------+
| Java| null| 1| Java|
| Python| 100000| 2|Python|
| Scala| 3000| 3| Scala|
+--------+-----------+------+------+
如果我想要 concat
列是 Java1
我需要编码:
df.withColumn('concat', regexp_replace(concat(
coalesce('language',lit('')),
coalesce('users_count', lit('')),
coalesce('status', lit('')) ), " ", "")).show()```
看起来像这样:
+--------+-----------+------+-------------+
|language|users_count|status| concat|
+--------+-----------+------+-------------+
| Java| null| 1| Java1|
| Python| 100000| 2|Python1000002|
| Scala| 3000| 3| Scala30003|
+--------+-----------+------+-------------+
任何人都可以帮助我修复 coalesce(*columns)
,这样我就不需要为 columns
中的每一列编写所有合并吗?谢谢
使用concat_ws
>>> df.show()
+--------+-----------+------+
|language|users_count|status|
+--------+-----------+------+
| Java| null| 1|
| Python| 100000| 2|
| Scala| 3000| 3|
+--------+-----------+------+
>>> df.withColumn('concat', concat_ws("",*columns)).show()
+--------+-----------+------+-------------+
|language|users_count|status| concat|
+--------+-----------+------+-------------+
| Java| null| 1| Java1|
| Python| 100000| 2|Python1000002|
| Scala| 3000| 3| Scala30003|
+--------+-----------+------+-------------+