如何从 Scala 数据框 2 中没有的数据框 1 内容中检索数据
how to retrieve data from data frame 1 contents that do not have in data frame 2 in Scala
我有如下两个数据框
DF1 内容
+------------------------------------+------------------+
|REQ_ID |PRS_ID |
+------------------------------------+------------------+
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185392677asdasd12321312321 |999999000185392677|
|999999000185392677asdasd12321312321 |999999000185392677|
|999999000185392677asdasd12321312321 |999999000185392677|
|be1e63ce-cdf6-407d-abf3-f818e0872e92|999999000185254510|
|048022cc-9c26-4c0d-a9a8-551f4a364510|999999000185298297|
|cd66629d-14db-42df-a558-49e78c3ae320|999999000185320831|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|d2824085-65d3-432f-a4dd-73e31453733a|999999000185266094|
|ebfde7dc-9352-42d4-816b-d2f01653c1c9|999999000185266027|
|dc8b5731-8d1a-4394-ae9d-f74098462be4|999999000185250909|
|9c642932-7a95-4bfe-ae75-687af9151fc8|990000000061356494|
|6469d0dd-0d1d-454b-96f3-87ea9de6db29|999999000185048192|
+------------------------------------+------------------+
DF2 Contents
+------------------------------------+------------------+
|REQ_ID |PRS_ID |
+------------------------------------+------------------+
|6b65a7c7-1c88-4aa8-9a22-ae8d17d4b276|990000000061357568|
|d713ed24-cbc0-4880-89ad-cabbd65e57f2|999999000184600448|
|7c8996fc-84a4-4cf0-a429-7c809281a7cc|999999000184649344|
|fdf784ee-ba8f-4efb-ab6e-41aa483b6b70|999999000184709120|
|6469d0dd-0d1d-454b-96f3-87ea9de6db29|999999000185048192|
|5b240d5a-c76e-4a27-aaaf-781250e2beda|999999000185064192|
|0cee0936-b0e7-4331-abdb-6ab388402d0b|999999000185200256|
|33d89b0f-2ad2-43aa-82f3-730d44e03b36|999999000185200384|
|9934f51e-fc31-4f2c-915b-fd47eab029a2|999999000185206656|
|75a94671-7baf-4237-927c-b713efe10412|999999000185216128|
|29d362df-bae8-41f0-b9bd-bbd4a386b480|999999000185216256|
|95a909c5-3d9d-4c95-a567-0e296761a8e2|999999000185217920|
|cd07591c-cda2-4900-914f-8b06d39f9357|999999000185252992|
|2f2eb612-484b-4b5b-9d6f-068a689a4738|999999000185258368|
|3bef0390-6540-4105-be5d-e8978d4414b8|999999000185271168|
|09d16ad0-50db-4f32-b98b-45c848804073|999999000185274880|
|037dbce6-bb13-4404-88af-a855216e2946|999999000185306112|
|efe3e3fd-1d3f-4d41-9c9c-863c04b7d94d|999999000185307136|
|1e18f1d8-cf34-49f4-aeb9-42c00baddd90|999999000185417856|
|b999ef86-6118-4560-8d5f-157882dc1bfc|999999000185456512|
|ebfde7dc-9352-42d4-816b-d2f01653c1c9|999999000185266027|
|999999000185386838asdasd12321312321,999999000185386838|
+------------------------------------+------------------+
我只需要DF1记录,DF2中没有,我的意思是只需要DF1记录,请忽略普通记录,只需要DF2记录
最终输出应该是这样的
[REQ_ID ,PRS_ID]
[048022cc-9c26-4c0d-a9a8-551f4a364510,999999000185298297]
[999999000185392677asdasd12321312321,999999000185392677]
[999999000185425636asdasd12321312321,999999000185425636]
[9c642932-7a95-4bfe-ae75-687af9151fc8,990000000061356494]
[be1e63ce-cdf6-407d-abf3-f818e0872e92,999999000185254510]
[cd66629d-14db-42df-a558-49e78c3ae320,999999000185320831]
[d2824085-65d3-432f-a4dd-73e31453733a,999999000185266094]
[dc8b5731-8d1a-4394-ae9d-f74098462be4,999999000185250909]
请尽快帮助我,感谢您的帮助。
except
函数应该可以解决您的需求。只是做
df1.except(df2)
你将拥有
+------------------------------------+------------------+
|REQ_ID |PRS_ID |
+------------------------------------+------------------+
|048022cc-9c26-4c0d-a9a8-551f4a364510|999999000185298297|
|d2824085-65d3-432f-a4dd-73e31453733a|999999000185266094|
|9c642932-7a95-4bfe-ae75-687af9151fc8|990000000061356494|
|999999000185425636asdasd12321312321 |999999000185425636|
|cd66629d-14db-42df-a558-49e78c3ae320|999999000185320831|
|dc8b5731-8d1a-4394-ae9d-f74098462be4|999999000185250909|
|be1e63ce-cdf6-407d-abf3-f818e0872e92|999999000185254510|
|999999000185392677asdasd12321312321 |999999000185392677|
+------------------------------------+------------------+
dropDuplicates
是一种 昂贵的方法,因为会发生洗牌 。但如果你想避免重复,那么你可以
df1.except(df2).dropDuplicates("REQ_ID", "PRS_ID")
Op 有 Spark 1.6.2,所以上面的 dropDuplicates 没有用,他有错误
val df3 = df1.except(df2).dropDuplicates("REQ_ID","PRS_ID") :35: error: overloaded method value dropDuplicates with alternatives: (colNames: Array[String])org.apache.spark.sql.DataFrame (colNames: Seq[String])org.apache.spark.sql.DataFrame ()org.apache.spark.sql.DataFrame cannot be applied to (String, String) val df3 = df1.except(df2).dropDuplicates("REQ_ID","PRS_ID")
所以你应该使用
df1.except(df2).dropDuplicates(Seq("REQ_ID", "PRS_ID"))
我有如下两个数据框
DF1 内容
+------------------------------------+------------------+
|REQ_ID |PRS_ID |
+------------------------------------+------------------+
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185425636asdasd12321312321 |999999000185425636|
|999999000185392677asdasd12321312321 |999999000185392677|
|999999000185392677asdasd12321312321 |999999000185392677|
|999999000185392677asdasd12321312321 |999999000185392677|
|be1e63ce-cdf6-407d-abf3-f818e0872e92|999999000185254510|
|048022cc-9c26-4c0d-a9a8-551f4a364510|999999000185298297|
|cd66629d-14db-42df-a558-49e78c3ae320|999999000185320831|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|999999000185386838asdasd12321312321 |999999000185386838|
|d2824085-65d3-432f-a4dd-73e31453733a|999999000185266094|
|ebfde7dc-9352-42d4-816b-d2f01653c1c9|999999000185266027|
|dc8b5731-8d1a-4394-ae9d-f74098462be4|999999000185250909|
|9c642932-7a95-4bfe-ae75-687af9151fc8|990000000061356494|
|6469d0dd-0d1d-454b-96f3-87ea9de6db29|999999000185048192|
+------------------------------------+------------------+
DF2 Contents
+------------------------------------+------------------+
|REQ_ID |PRS_ID |
+------------------------------------+------------------+
|6b65a7c7-1c88-4aa8-9a22-ae8d17d4b276|990000000061357568|
|d713ed24-cbc0-4880-89ad-cabbd65e57f2|999999000184600448|
|7c8996fc-84a4-4cf0-a429-7c809281a7cc|999999000184649344|
|fdf784ee-ba8f-4efb-ab6e-41aa483b6b70|999999000184709120|
|6469d0dd-0d1d-454b-96f3-87ea9de6db29|999999000185048192|
|5b240d5a-c76e-4a27-aaaf-781250e2beda|999999000185064192|
|0cee0936-b0e7-4331-abdb-6ab388402d0b|999999000185200256|
|33d89b0f-2ad2-43aa-82f3-730d44e03b36|999999000185200384|
|9934f51e-fc31-4f2c-915b-fd47eab029a2|999999000185206656|
|75a94671-7baf-4237-927c-b713efe10412|999999000185216128|
|29d362df-bae8-41f0-b9bd-bbd4a386b480|999999000185216256|
|95a909c5-3d9d-4c95-a567-0e296761a8e2|999999000185217920|
|cd07591c-cda2-4900-914f-8b06d39f9357|999999000185252992|
|2f2eb612-484b-4b5b-9d6f-068a689a4738|999999000185258368|
|3bef0390-6540-4105-be5d-e8978d4414b8|999999000185271168|
|09d16ad0-50db-4f32-b98b-45c848804073|999999000185274880|
|037dbce6-bb13-4404-88af-a855216e2946|999999000185306112|
|efe3e3fd-1d3f-4d41-9c9c-863c04b7d94d|999999000185307136|
|1e18f1d8-cf34-49f4-aeb9-42c00baddd90|999999000185417856|
|b999ef86-6118-4560-8d5f-157882dc1bfc|999999000185456512|
|ebfde7dc-9352-42d4-816b-d2f01653c1c9|999999000185266027|
|999999000185386838asdasd12321312321,999999000185386838|
+------------------------------------+------------------+
我只需要DF1记录,DF2中没有,我的意思是只需要DF1记录,请忽略普通记录,只需要DF2记录
最终输出应该是这样的
[REQ_ID ,PRS_ID]
[048022cc-9c26-4c0d-a9a8-551f4a364510,999999000185298297]
[999999000185392677asdasd12321312321,999999000185392677]
[999999000185425636asdasd12321312321,999999000185425636]
[9c642932-7a95-4bfe-ae75-687af9151fc8,990000000061356494]
[be1e63ce-cdf6-407d-abf3-f818e0872e92,999999000185254510]
[cd66629d-14db-42df-a558-49e78c3ae320,999999000185320831]
[d2824085-65d3-432f-a4dd-73e31453733a,999999000185266094]
[dc8b5731-8d1a-4394-ae9d-f74098462be4,999999000185250909]
请尽快帮助我,感谢您的帮助。
except
函数应该可以解决您的需求。只是做
df1.except(df2)
你将拥有
+------------------------------------+------------------+
|REQ_ID |PRS_ID |
+------------------------------------+------------------+
|048022cc-9c26-4c0d-a9a8-551f4a364510|999999000185298297|
|d2824085-65d3-432f-a4dd-73e31453733a|999999000185266094|
|9c642932-7a95-4bfe-ae75-687af9151fc8|990000000061356494|
|999999000185425636asdasd12321312321 |999999000185425636|
|cd66629d-14db-42df-a558-49e78c3ae320|999999000185320831|
|dc8b5731-8d1a-4394-ae9d-f74098462be4|999999000185250909|
|be1e63ce-cdf6-407d-abf3-f818e0872e92|999999000185254510|
|999999000185392677asdasd12321312321 |999999000185392677|
+------------------------------------+------------------+
dropDuplicates
是一种 昂贵的方法,因为会发生洗牌 。但如果你想避免重复,那么你可以
df1.except(df2).dropDuplicates("REQ_ID", "PRS_ID")
Op 有 Spark 1.6.2,所以上面的 dropDuplicates 没有用,他有错误
val df3 = df1.except(df2).dropDuplicates("REQ_ID","PRS_ID") :35: error: overloaded method value dropDuplicates with alternatives: (colNames: Array[String])org.apache.spark.sql.DataFrame (colNames: Seq[String])org.apache.spark.sql.DataFrame ()org.apache.spark.sql.DataFrame cannot be applied to (String, String) val df3 = df1.except(df2).dropDuplicates("REQ_ID","PRS_ID")
所以你应该使用
df1.except(df2).dropDuplicates(Seq("REQ_ID", "PRS_ID"))