scala spark,如何将一组列合并为数据框上的一个列?
scala spark, how do I merge a set of columns to a single one on a dataframe?
我正在寻找一种无需 UDF 即可执行此操作的方法,我想知道是否可行。假设我有一个 DF,如下所示:
Buyer_name Buyer_state CoBuyer_name CoBuyers_state Price Date
Bob CA Joe CA 20 010119
Stacy IL Jamie IL 50 020419
... about 3 millions more rows...
我想把它变成:
Buyer_name Buyer_state Price Date
Bob CA 20 010119
Joe CA 20 010119
Stacy IL 50 020419
Jamie IL 50 020419
...
编辑:我也可以,
创建两个数据框,从一个中删除 "Buyer" 列,从另一个中删除 "Cobuyer" 列。
将具有 "Cobuyer" 列的数据框重命名为 "Buyer" 列。
连接两个数据帧。
您可以将 struct(Buyer_name, Buyer_state)
和 struct(CoBuyer_name, CoBuyer_state)
组合成一个 Array
然后使用 explode
展开,如下所示:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
df.
withColumn("Buyers", array(
struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
)).
withColumn("Buyer", explode($"Buyers")).
select(
$"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price| Date|
// +----------+-----------+-----+------+
// | Bob| CA| 20|010119|
// | Joe| CA| 20|010119|
// | Stacy| IL| 50|020419|
// | Jamie| IL| 50|020419|
// +----------+-----------+-----+------+
这对我来说听起来像是一个逆向操作,可以用 Scala 中的 union
函数来完成:
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))
df_new.show
感谢 Leo 提供我重新使用的数据框定义。
我正在寻找一种无需 UDF 即可执行此操作的方法,我想知道是否可行。假设我有一个 DF,如下所示:
Buyer_name Buyer_state CoBuyer_name CoBuyers_state Price Date
Bob CA Joe CA 20 010119
Stacy IL Jamie IL 50 020419
... about 3 millions more rows...
我想把它变成:
Buyer_name Buyer_state Price Date
Bob CA 20 010119
Joe CA 20 010119
Stacy IL 50 020419
Jamie IL 50 020419
...
编辑:我也可以,
创建两个数据框,从一个中删除 "Buyer" 列,从另一个中删除 "Cobuyer" 列。
将具有 "Cobuyer" 列的数据框重命名为 "Buyer" 列。
连接两个数据帧。
您可以将 struct(Buyer_name, Buyer_state)
和 struct(CoBuyer_name, CoBuyer_state)
组合成一个 Array
然后使用 explode
展开,如下所示:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
df.
withColumn("Buyers", array(
struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
)).
withColumn("Buyer", explode($"Buyers")).
select(
$"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price| Date|
// +----------+-----------+-----+------+
// | Bob| CA| 20|010119|
// | Joe| CA| 20|010119|
// | Stacy| IL| 50|020419|
// | Jamie| IL| 50|020419|
// +----------+-----------+-----+------+
这对我来说听起来像是一个逆向操作,可以用 Scala 中的 union
函数来完成:
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))
df_new.show
感谢 Leo 提供我重新使用的数据框定义。