火花中的条件

Where Condition in spark

我有以下数据集,

+---------+---------+----------+-----------+-----------+-----------+
| Column1 | Column2 | Column3  | Exspense1 | Exspense2 | Exspense3 |
+---------+---------+----------+-----------+-----------+-----------+
| null    | null    | null     | 175935.40 |   2557400 |         0 |
| null    | null    | 20160511 | 94598.40  |  13050360 |         0 |
| null    | null    | 20160512 | 81337.00  |  12523645 |         0 |
| null    | Item1   | null     | 24955.20  |   4206475 |         0 |
| null    | Item1   | 20160511 | 14143.30  |   2357534 |         0 |
| null    | Item1   | 20160512 | 10811.90  |   1848941 |         0 |
| null    | Item2   | null     | 26725.20  |   2188031 |         0 |
| null    | Item2   | 20160511 | 17807.50  |   1400011 |         0 |
| null    | Item2   | 20160512 | 8917.70   |    788020 |         0 |
| null    | Item3   | null     | 19234.30  |   2787529 |         0 |
| null    | Item3   | 20160511 | 8204.30   |   1162487 |         0 |
| null    | Item3   | 20160512 | 11030.00  |   1625042 |         0 |
| null    | Item4   | null     | 85239.20  |  13848186 |         0 |
| null    | Item4   | 20160511 | 47324.10  |   7157838 |         0 |
| null    | Item4   | 20160512 | 37915.10  |   6690348 |         0 |
| null    | Item5   | null     | 19781.50  |   2543784 |         0 |
| null    | Item5   | 20160511 | 7119.209  |     72490 |         0 |
| null    | Item5   | 20160512 | 12662.30  |   1571294 |         0 |
| Shop1   | null    | null     | 35.70     |     10577 |         0 |
| Shop1   | null    | 20160512 | 35.701    |      0577 |         0 |
| Shop1   | Item1   | null     | 34.40     |     10538 |         0 |
| Shop1   | Item1   | 20160512 | 34.401    |      0538 |         0 |
| Shop1   | Item3   | null     | 1.30      |        39 |         0 |
| Shop1   | Item3   | 20160512 | 1.30      |        39 |         0 |
| Shop2   | null    | null     | 10757.30  |   2163921 |         0 |
| Shop2   | null    | 20160511 | 6672.20   |   1286947 |         0 |
| Shop2   | null    | 20160512 | 4085.10   |    876974 |         0 |
| Shop2   | Item1   | null     | 1510.30   |    370818 |         0 |
| Shop2   | Item1   | 20160511 | 752.101   |     90052 |         0 |
| Shop2   | Item1   | 20160512 | 758.201   |     80766 |         0 |
+---------+---------+----------+-----------+-----------+-----------+

我正在检查下面的每一列:boolean sumCheck, 我必须在每一列中循环。现在,

1.for Column1 if sumCheck is true 我必须过滤 Column1 不为空且同一行前一列为空的行,因为 Column1 是第一列所以没有过滤器,

  1. 对于第 2 列: 如果检查为真, 然后我必须过滤 Column2 不为 nullColumn1 为 null 的行 这意味着我不想要 (Column2 is not null and Column1 is null) 所在的行 我必须得到下面,

<table><tbody><tr><th>Column1</th><th>Column2</th><th>Column3</th><th>Exspense1</th><th>Exspense2</th><th>Exspense3</th></tr><tr><td>null</td><td>null</td><td>null</td><td>175935.40</td><td>2557400</td><td>0</td></tr><tr><td>null</td><td>null</td><td>20160511</td><td>94598.40</td><td>13050360</td><td>0</td></tr><tr><td>null</td><td>null</td><td>20160512</td><td>81337.00</td><td>12523645</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>null</td><td>35.70</td><td>10577</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>20160512</td><td>35.701</td><td>0577</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>null</td><td>34.40</td><td>10538</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>20160512</td><td>34.401</td><td>0538</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>null</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>20160512</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>null</td><td>10757.30</td><td>2163921</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>20160511</td><td>6672.20</td><td>1286947</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>20160512</td><td>4085.10</td><td>876974</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>null</td><td>1510.30</td><td>370818</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160511</td><td>752.101</td><td>90052</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160512</td><td>758.201</td><td>80766</td><td>0</td></tr></tbody></table>

  1. 对于第 3 列,如果检查为真,我必须过滤数据集,以便第 3 列所在的位置 我必须删除 Column3 不为空且 Column2 为空的行; 这样我就可以低于 ,

<table><tbody><tr><th>Column1</th><th>Column2</th><th>Column3</th><th>Exspense1</th><th>Exspense2</th><th>Exspense3</th></tr><tr><td>null</td><td>null</td><td>null</td><td>175935.40</td><td>2557400</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>null</td><td>35.70</td><td>10577</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>null</td><td>34.40</td><td>10538</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>20160512</td><td>34.401</td><td>0538</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>null</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>20160512</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>null</td><td>10757.30</td><td>2163921</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>null</td><td>1510.30</td><td>370818</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160511</td><td>752.101</td><td>90052</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160512</td><td>758.201</td><td>80766</td><td>0</td></tr></tbody></table>

我目前执行以下步骤:

对于每个列大小,我循环并查看标志; 我从第二栏开始: 第二列:

val exceptDf=dataset.filter("Column2 is not null and Column 1 is null");

对于第三列:

val  exceptDf3=exceptDf.union(dataset.filter("Column3 is not null and Column 2 is null"));

最后我做了

dataset.except(exceptDf3);

因为我正在使用 union except filter 我只是想看看是否有任何方法或 filter 可以避免我使用 unionsexept函数。

请帮助我获得想要的结果。

您可以使用 spark wherefilter 函数。

示例数据集:

+----+---+----+---+
|  c1| c2|  c3| c4|
+----+---+----+---+
| 2.2|v21|   1|foo|
|null|v22|   2|bar|
| 4.4|v23|   3|baz|
| 5.5|v24|null|foo|
+----+---+----+---+

我必须检查条件:c2 != null and c1 == null and c4 != null and c3 == null:

使用where:

df.where("(not(c2 is not null and c1 is null)) and (not(c4 is not null and c3 is null))")

使用filter:

df.filter( !(col("c2").isNotNull && col("c1").isNull) && !(col("c4").isNotNull && col("c3").isNull) )

输出:

+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|2.2|v21|  1|foo|
|4.4|v23|  3|baz|
+---+---+---+---+