PySpark:多个 Where 条件或 AND (&&)
PySpark: Muliple Where condition vs AND (&&)
.where((col('Country')==Country) & (col('Year')>startYear))
我可以两种方式执行 where 条件。我认为下面的那个增加了可读性。还有其他区别吗?哪个最好?
.where(col('Country')==Country)
.where(col('Year')>startYear)
如果问题是可读性,我会建议这样的事情:
.where(F.expr("Country <=> 'Country' and Year > 'startYear'")
这里<=>
用于相等null安全,spark中有一个东西在条件中忽略了null值。
我曾使用过一个示例,两者都给出了相同的结果。所以不会有其他的区别。
data.show()
+---+---------+----+
| id| Country|year|
+---+---------+----+
| 1| india|2018|
| 2| usa|2018|
| 3| france|2019|
| 4| china|2019|
| 5| india|2020|
| 6|australia|2021|
| 7| india|2016|
| 8| usa|2019|
+---+---------+----+
data.where((col('Country')=='india') & (col('Year')>2017)).show()
#consider country:'india',startyear:2017
+---+-------+----+
| id|Country|year|
+---+-------+----+
| 1| india|2018|
| 5| india|2020|
+---+-------+----+
data.where(col('Country')=='india')\
.where(col('Year')>2017).show()
+---+-------+----+
| id|Country|year|
+---+-------+----+
| 1| india|2018|
| 5| india|2020|
+---+-------+----+
方法 explain
有助于理解查询的执行方式。它显示了包含所有涉及步骤的执行计划,在这种情况下可用于比较两种过滤策略。
举个例子:
from pyspark.sql.functions import col
df = spark.createDataFrame([("Spain", 2020),
("Italy", 2020),
("Andorra", 2021),
("Spain", 2021),
("Spain", 2022)], ("Country", "Year"))
df.show()
Country = "Spain"
startYear = 2020
AND策略的扩展输出为:
df.where((col('Country') == Country) & (col('Year') > startYear)).explain(True)
== Parsed Logical Plan ==
'Filter (('Country = Spain) AND ('Year > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Analyzed Logical Plan ==
Country: string, Year: bigint
Filter ((Country#80 = Spain) AND (Year#81L > cast(2020 as bigint)))
+- LogicalRDD [Country#80, Year#81L], false
== Optimized Logical Plan ==
Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Physical Plan ==
*(1) Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- *(1) Scan ExistingRDD[Country#80,Year#81L]
而where策略的倍数计划是:
df.where(col('Country') == Country).where(col('Year') > startYear).explain(True)
== Parsed Logical Plan ==
'Filter ('Year > 2020)
+- Filter (Country#80 = Spain)
+- LogicalRDD [Country#80, Year#81L], false
== Analyzed Logical Plan ==
Country: string, Year: bigint
Filter (Year#81L > cast(2020 as bigint))
+- Filter (Country#80 = Spain)
+- LogicalRDD [Country#80, Year#81L], false
== Optimized Logical Plan ==
Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Physical Plan ==
*(1) Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- *(1) Scan ExistingRDD[Country#80,Year#81L]
无论过滤策略如何,查询引擎都得出相同的物理计划,因此查询是等价的。我同意你的看法,第二个更利于可读性
.where((col('Country')==Country) & (col('Year')>startYear))
我可以两种方式执行 where 条件。我认为下面的那个增加了可读性。还有其他区别吗?哪个最好?
.where(col('Country')==Country)
.where(col('Year')>startYear)
如果问题是可读性,我会建议这样的事情:
.where(F.expr("Country <=> 'Country' and Year > 'startYear'")
这里<=>
用于相等null安全,spark中有一个东西在条件中忽略了null值。
我曾使用过一个示例,两者都给出了相同的结果。所以不会有其他的区别。
data.show()
+---+---------+----+
| id| Country|year|
+---+---------+----+
| 1| india|2018|
| 2| usa|2018|
| 3| france|2019|
| 4| china|2019|
| 5| india|2020|
| 6|australia|2021|
| 7| india|2016|
| 8| usa|2019|
+---+---------+----+
data.where((col('Country')=='india') & (col('Year')>2017)).show()
#consider country:'india',startyear:2017
+---+-------+----+
| id|Country|year|
+---+-------+----+
| 1| india|2018|
| 5| india|2020|
+---+-------+----+
data.where(col('Country')=='india')\
.where(col('Year')>2017).show()
+---+-------+----+
| id|Country|year|
+---+-------+----+
| 1| india|2018|
| 5| india|2020|
+---+-------+----+
方法 explain
有助于理解查询的执行方式。它显示了包含所有涉及步骤的执行计划,在这种情况下可用于比较两种过滤策略。
举个例子:
from pyspark.sql.functions import col
df = spark.createDataFrame([("Spain", 2020),
("Italy", 2020),
("Andorra", 2021),
("Spain", 2021),
("Spain", 2022)], ("Country", "Year"))
df.show()
Country = "Spain"
startYear = 2020
AND策略的扩展输出为:
df.where((col('Country') == Country) & (col('Year') > startYear)).explain(True)
== Parsed Logical Plan ==
'Filter (('Country = Spain) AND ('Year > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Analyzed Logical Plan ==
Country: string, Year: bigint
Filter ((Country#80 = Spain) AND (Year#81L > cast(2020 as bigint)))
+- LogicalRDD [Country#80, Year#81L], false
== Optimized Logical Plan ==
Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Physical Plan ==
*(1) Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- *(1) Scan ExistingRDD[Country#80,Year#81L]
而where策略的倍数计划是:
df.where(col('Country') == Country).where(col('Year') > startYear).explain(True)
== Parsed Logical Plan ==
'Filter ('Year > 2020)
+- Filter (Country#80 = Spain)
+- LogicalRDD [Country#80, Year#81L], false
== Analyzed Logical Plan ==
Country: string, Year: bigint
Filter (Year#81L > cast(2020 as bigint))
+- Filter (Country#80 = Spain)
+- LogicalRDD [Country#80, Year#81L], false
== Optimized Logical Plan ==
Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Physical Plan ==
*(1) Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- *(1) Scan ExistingRDD[Country#80,Year#81L]
无论过滤策略如何,查询引擎都得出相同的物理计划,因此查询是等价的。我同意你的看法,第二个更利于可读性