Spark SQL 会不会算错或者我会不会写错SQL?
Can Spark SQL not count correctly or can I not write SQL correctly?
在 Databricks "Community Edition" 上的 Python 笔记本中,我正在试验旧金山市关于拨打 911 请求消防员的紧急呼叫的开放数据。 ("Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) 中使用的数据的旧 2016 副本,并在该教程的 S3 上可用。)
挂载数据并使用显式定义的模式将其读取到 DataFrame fire_service_calls_df
后,我将该 DataFrame 别名为 SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
有了它和 DataFrame API,我可以计算发生的调用类型:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... 或 SQL 在 Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
| 33|
+------------------------+
... 或使用 SQL 单元格:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
为什么我得到两个不同的计数结果?(看起来 34 是正确的,即使 talk in the video 和随附的教程笔记本提到“35”。)
回答问题
Can Spark SQL not count correctly or can I not write SQL correctly?
来自标题:我不会写SQL。
规则 写作SQL:想想NULL
和UNDEFINED
.
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
34
另外,我显然看不懂:
With only 30 something values, you could just sort and print all the distinct items to see where the difference is.
其实我自己也想到了。 (减去排序。)除了,没有任何区别,输出中总是有 34 种调用类型,无论我是用 SQL 还是 DataFrame 查询生成的。我根本没有注意到其中一个被不祥地命名为 null
:
+--------------------------------------------+
|CallType |
+--------------------------------------------+
|Elevator / Escalator Rescue |
|Marine Fire |
|Aircraft Emergency |
|Confined Space / Structure Collapse |
|Administrative |
|Alarms |
|Odor (Strange / Unknown) |
|Lightning Strike (Investigation) |
|null |
|Citizen Assist / Service Call |
|HazMat |
|Watercraft in Distress |
|Explosion |
|Oil Spill |
|Vehicle Fire |
|Suspicious Package |
|Train / Rail Fire |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other |
|Transfer |
|Outside Fire |
|Traffic Collision |
|Assist Police |
|Gas Leak (Natural and LP Gases) |
|Water Rescue |
|Electrical Hazard |
|High Angle Rescue |
|Structure Fire |
|Industrial Accidents |
|Medical Incident |
|Mutual Aid / Assist Outside Agency |
|Fuel Spill |
|Smoke Investigation (Outside) |
|Train / Rail Incident |
+--------------------------------------------+
在 Databricks "Community Edition" 上的 Python 笔记本中,我正在试验旧金山市关于拨打 911 请求消防员的紧急呼叫的开放数据。 ("Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) 中使用的数据的旧 2016 副本,并在该教程的 S3 上可用。)
挂载数据并使用显式定义的模式将其读取到 DataFrame fire_service_calls_df
后,我将该 DataFrame 别名为 SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
有了它和 DataFrame API,我可以计算发生的调用类型:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... 或 SQL 在 Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+ |count(DISTINCT CallType)| +------------------------+ | 33| +------------------------+
... 或使用 SQL 单元格:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
为什么我得到两个不同的计数结果?(看起来 34 是正确的,即使 talk in the video 和随附的教程笔记本提到“35”。)
回答问题
Can Spark SQL not count correctly or can I not write SQL correctly?
来自标题:我不会写SQL。
规则NULL
和UNDEFINED
.
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
34
另外,我显然看不懂:
With only 30 something values, you could just sort and print all the distinct items to see where the difference is.
其实我自己也想到了。 (减去排序。)除了,没有任何区别,输出中总是有 34 种调用类型,无论我是用 SQL 还是 DataFrame 查询生成的。我根本没有注意到其中一个被不祥地命名为 null
:
+--------------------------------------------+ |CallType | +--------------------------------------------+ |Elevator / Escalator Rescue | |Marine Fire | |Aircraft Emergency | |Confined Space / Structure Collapse | |Administrative | |Alarms | |Odor (Strange / Unknown) | |Lightning Strike (Investigation) | |null | |Citizen Assist / Service Call | |HazMat | |Watercraft in Distress | |Explosion | |Oil Spill | |Vehicle Fire | |Suspicious Package | |Train / Rail Fire | |Extrication / Entrapped (Machinery, Vehicle)| |Other | |Transfer | |Outside Fire | |Traffic Collision | |Assist Police | |Gas Leak (Natural and LP Gases) | |Water Rescue | |Electrical Hazard | |High Angle Rescue | |Structure Fire | |Industrial Accidents | |Medical Incident | |Mutual Aid / Assist Outside Agency | |Fuel Spill | |Smoke Investigation (Outside) | |Train / Rail Incident | +--------------------------------------------+