Pyspark - 如何根据数据框 2 中的列值在数据框 1 中插入记录
Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2
我需要根据另一个 table 中的记录数将记录插入 table1,比如说 table2,使用 pyspark 的 spark.sql()。目前可以通过加入获得一条记录,但我需要根据第二个 table.
将尽可能多的记录插入 table1
我在这里提供示例数据框:
df1= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 01"),("xxx1","81A01","TERR NAME 02"), ("xxx1","81A01","TERR NAME 03")], ["zip_code","zone_code","territory_name"])
df2= sqlContext.createDataFrame([("xxx1","81A01","","NY")], ["zip_code","zone_code","territory_name","state"])
df1.show()
+--------+--------------+--------------+
|zip_code|zone_code |territory_name|
+--------+--------------+--------------+
| xxx1| 81A01| TERR NAME 01|
| xxx1| 81A01| TERR NAME 02|
| xxx1| 81A01| TERR NAME 03|
+---------------------------------------
# Print out information about this data
df2.show()
+--------+--------------+--------------+-----+
|zip_code|zone_code |territory_name|state|
+--------+--------------+--------------+-----+
| xxx1| 81A01| null | NY|
+---------------------------------------------
在上面的示例中,我需要根据 zip_code 将 df2 与 df1 连接起来,并在 df1 中获得与 territory_names 一样多的记录。
df2 中的预期结果是:
+--------+--------------+--------------+-----+
|zip_code|zone_code |territory_name|state|
+--------+--------------+--------------+-----+
| xxx1| 81A01| TERR NAME 01| NY|
| xxx1| 81A01| TERR NAME 02| NY|
| xxx1| 81A01| TERR NAME 03| NY|
+---------------------------------------------
需要帮助,目前可以通过加入获得一条记录
Spark.sql query sample for getting one record:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
spark.sql("select a.zip_code,a.zone_code,b.territory_name,a.state from df1 a
left join df2 b on a.zip_code = b.zip_code where a.territory_name is null").createOrReplaceTempView('df2')
谢谢
想提供代码片段,也许对某些人有用。
df1= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 01"),("xxx1","81A01","TERR NAME 02"), ("xxx1","81A01","TERR NAME 03")], ["zip_code","zone_code","territory_name"])
df2= sqlContext.createDataFrame([("xxx1","","","NY"), ("xxx1","","TERR NAME 99","NY")], ["zip_code","zone_code","territory_name","state"])
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
spark.sql(“select * from df1”)
+--------+---------+--------------+
|zip_code|zone_code|territory_name|
+--------+---------+--------------+
| xxx1 | 81A01 | TERR NAME 01 |
| xxx1 | 81A01 | TERR NAME 02 |
| xxx1 | 81A01 | TERR NAME 03 |
+--------+---------+--------------+
spark.sql(“select * from df2”)
+--------+---------+--------------+-----+
|zip_code|zone_code|territory_name|state|
+--------+---------+--------------+-----+
| xxx1 | | | NY |
| xxx1 | | TERR NAME 99 | NY |
+--------+---------+--------------+-----+
spark.sql("""select a.zip_code, b.zone_code, b.territory_name, a.state from df2 a
left join df1 b
on a.zip_code = b.zip_code
where a.territory_name = ''
UNION
select a.zip_code, b.zone_code, a.territory_name, a.state from df2 a
left join df1 b
on a.zip_code = b.zip_code
where a.territory_name != ''
""").createOrReplaceTempView('df3')
spark.sql(“select * from df3”)
+--------+---------+--------------+-----+
|zip_code|zone_code|territory_name|state|
+--------+---------+--------------+-----+
| xxx1 | 81A01 | TERR NAME 03 | NY |
| xxx1 | 81A01 | TERR NAME 99 | NY |
| xxx1 | 81A01 | TERR NAME 01 | NY |
| xxx1 | 81A01 | TERR NAME 02 | NY |
+--------+---------+--------------+-----+
感谢帮助过的人
我需要根据另一个 table 中的记录数将记录插入 table1,比如说 table2,使用 pyspark 的 spark.sql()。目前可以通过加入获得一条记录,但我需要根据第二个 table.
将尽可能多的记录插入 table1我在这里提供示例数据框:
df1= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 01"),("xxx1","81A01","TERR NAME 02"), ("xxx1","81A01","TERR NAME 03")], ["zip_code","zone_code","territory_name"])
df2= sqlContext.createDataFrame([("xxx1","81A01","","NY")], ["zip_code","zone_code","territory_name","state"])
df1.show()
+--------+--------------+--------------+
|zip_code|zone_code |territory_name|
+--------+--------------+--------------+
| xxx1| 81A01| TERR NAME 01|
| xxx1| 81A01| TERR NAME 02|
| xxx1| 81A01| TERR NAME 03|
+---------------------------------------
# Print out information about this data
df2.show()
+--------+--------------+--------------+-----+
|zip_code|zone_code |territory_name|state|
+--------+--------------+--------------+-----+
| xxx1| 81A01| null | NY|
+---------------------------------------------
在上面的示例中,我需要根据 zip_code 将 df2 与 df1 连接起来,并在 df1 中获得与 territory_names 一样多的记录。
df2 中的预期结果是:
+--------+--------------+--------------+-----+
|zip_code|zone_code |territory_name|state|
+--------+--------------+--------------+-----+
| xxx1| 81A01| TERR NAME 01| NY|
| xxx1| 81A01| TERR NAME 02| NY|
| xxx1| 81A01| TERR NAME 03| NY|
+---------------------------------------------
需要帮助,目前可以通过加入获得一条记录
Spark.sql query sample for getting one record:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
spark.sql("select a.zip_code,a.zone_code,b.territory_name,a.state from df1 a
left join df2 b on a.zip_code = b.zip_code where a.territory_name is null").createOrReplaceTempView('df2')
谢谢
想提供代码片段,也许对某些人有用。
df1= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 01"),("xxx1","81A01","TERR NAME 02"), ("xxx1","81A01","TERR NAME 03")], ["zip_code","zone_code","territory_name"])
df2= sqlContext.createDataFrame([("xxx1","","","NY"), ("xxx1","","TERR NAME 99","NY")], ["zip_code","zone_code","territory_name","state"])
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
spark.sql(“select * from df1”)
+--------+---------+--------------+
|zip_code|zone_code|territory_name|
+--------+---------+--------------+
| xxx1 | 81A01 | TERR NAME 01 |
| xxx1 | 81A01 | TERR NAME 02 |
| xxx1 | 81A01 | TERR NAME 03 |
+--------+---------+--------------+
spark.sql(“select * from df2”)
+--------+---------+--------------+-----+
|zip_code|zone_code|territory_name|state|
+--------+---------+--------------+-----+
| xxx1 | | | NY |
| xxx1 | | TERR NAME 99 | NY |
+--------+---------+--------------+-----+
spark.sql("""select a.zip_code, b.zone_code, b.territory_name, a.state from df2 a
left join df1 b
on a.zip_code = b.zip_code
where a.territory_name = ''
UNION
select a.zip_code, b.zone_code, a.territory_name, a.state from df2 a
left join df1 b
on a.zip_code = b.zip_code
where a.territory_name != ''
""").createOrReplaceTempView('df3')
spark.sql(“select * from df3”)
+--------+---------+--------------+-----+
|zip_code|zone_code|territory_name|state|
+--------+---------+--------------+-----+
| xxx1 | 81A01 | TERR NAME 03 | NY |
| xxx1 | 81A01 | TERR NAME 99 | NY |
| xxx1 | 81A01 | TERR NAME 01 | NY |
| xxx1 | 81A01 | TERR NAME 02 | NY |
+--------+---------+--------------+-----+
感谢帮助过的人