根据多个条件过滤列:Scala Spark
Filter a column based on multiple conditions: Scala Spark
我在尝试根据多个条件过滤列中的行时遇到问题。基本上我将我的多个条件存储在一个数组中,我想过滤所有条件。但是,最后我一直收到错误消息。谁能提出解决这个问题的方法?这是我要实现的一些示例代码:
// Now let's filter through the ADM1 codes to select all 50 US States
val stateArray = Array("USAL", "USMD", "USCA", "USME", "USND", "USSD", "USWY", "USAK", "USWA", "USFL",
"USGA", "USSC", "USNC", "USMA", "USNH", "USVT", "USAR", "USAZ", "USTX", "USLA", "USIL", "USOR", "USNV",
"USID", "USMN", "USNM", "USNE", "USNJ", "USDE", "USVA", "USWV", "USTN", "USKY", "USNY", "USPA", "USIN",
"USOH", "USHI", "USOK", "USIA", "USMI", "USMS", "USMO", "USCO", "USKS", "USUT", "USWI", "USMT", "USRI",
"USCT")
// Let's filter through all of these conditions
val tmpDf3 = tmpDf1.filter(tmpDf("Actor1Geo_ADM1Code") === stateArray)
// I can do this with a for loop, but I want everything in one data frame
for(n <- stateArray) {
val tmpDf2 = tmpDf1
.filter(tmpDf1("Actor1Geo_ADM1Code") === n)
tmpDf2.show(false)
tmpDf2.printSchema()
}
使用isin
:
tmpDf1.filter(tmpDf("Actor1Geo_ADM1Code").isin(stateArray: _*))
示例:
val states = Array("USAL", "USMD")
// states: Array[String] = Array(USAL, USMD)
val df = Seq((1, "USAL"), (2, "USMD"), (3, "USGA")).toDF("id", "Actor1Geo_ADM1Code")
// df: org.apache.spark.sql.DataFrame = [id: int, Actor1Geo_ADM1Code: string]
df.show
+---+------------------+
| id|Actor1Geo_ADM1Code|
+---+------------------+
| 1| USAL|
| 2| USMD|
| 3| USGA|
+---+------------------+
df.filter(df("Actor1Geo_ADM1Code").isin(states: _*)).show
+---+------------------+
| id|Actor1Geo_ADM1Code|
+---+------------------+
| 1| USAL|
| 2| USMD|
+---+------------------+
我在尝试根据多个条件过滤列中的行时遇到问题。基本上我将我的多个条件存储在一个数组中,我想过滤所有条件。但是,最后我一直收到错误消息。谁能提出解决这个问题的方法?这是我要实现的一些示例代码:
// Now let's filter through the ADM1 codes to select all 50 US States
val stateArray = Array("USAL", "USMD", "USCA", "USME", "USND", "USSD", "USWY", "USAK", "USWA", "USFL",
"USGA", "USSC", "USNC", "USMA", "USNH", "USVT", "USAR", "USAZ", "USTX", "USLA", "USIL", "USOR", "USNV",
"USID", "USMN", "USNM", "USNE", "USNJ", "USDE", "USVA", "USWV", "USTN", "USKY", "USNY", "USPA", "USIN",
"USOH", "USHI", "USOK", "USIA", "USMI", "USMS", "USMO", "USCO", "USKS", "USUT", "USWI", "USMT", "USRI",
"USCT")
// Let's filter through all of these conditions
val tmpDf3 = tmpDf1.filter(tmpDf("Actor1Geo_ADM1Code") === stateArray)
// I can do this with a for loop, but I want everything in one data frame
for(n <- stateArray) {
val tmpDf2 = tmpDf1
.filter(tmpDf1("Actor1Geo_ADM1Code") === n)
tmpDf2.show(false)
tmpDf2.printSchema()
}
使用isin
:
tmpDf1.filter(tmpDf("Actor1Geo_ADM1Code").isin(stateArray: _*))
示例:
val states = Array("USAL", "USMD")
// states: Array[String] = Array(USAL, USMD)
val df = Seq((1, "USAL"), (2, "USMD"), (3, "USGA")).toDF("id", "Actor1Geo_ADM1Code")
// df: org.apache.spark.sql.DataFrame = [id: int, Actor1Geo_ADM1Code: string]
df.show
+---+------------------+
| id|Actor1Geo_ADM1Code|
+---+------------------+
| 1| USAL|
| 2| USMD|
| 3| USGA|
+---+------------------+
df.filter(df("Actor1Geo_ADM1Code").isin(states: _*)).show
+---+------------------+
| id|Actor1Geo_ADM1Code|
+---+------------------+
| 1| USAL|
| 2| USMD|
+---+------------------+