Spark SQL 会不会算错或者我会不会写错SQL?

Can Spark SQL not count correctly or can I not write SQL correctly?

在 Databricks "Community Edition" 上的 Python 笔记本中,我正在试验旧金山市关于拨打 911 请求消防员的紧急呼叫的开放数据。 ("Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) 中使用的数据的旧 2016 副本,并在该教程的 S3 上可用。)

挂载数据并使用显式定义的模式将其读取到 DataFrame fire_service_calls_df 后,我将该 DataFrame 别名为 SQL table:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

有了它和 DataFrame API,我可以计算发生的调用类型:

fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34

... 或 SQL 在 Python:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

... 或使用 SQL 单元格:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

为什么我得到两个不同的计数结果?(看起来 34 是正确的,即使 talk in the video 和随附的教程笔记本提到“35”。)

回答问题

Can Spark SQL not count correctly or can I not write SQL correctly?

来自标题:我不会写SQL。

规则写作SQL:想想NULLUNDEFINED.

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)

34

另外,我显然看不懂:

故障suggested in a comment

With only 30 something values, you could just sort and print all the distinct items to see where the difference is.

其实我自己也想到了。 (减去排序。)除了,没有任何区别,输出中总是有 34 种调用类型,无论我是用 SQL 还是 DataFrame 查询生成的。我根本没有注意到其中一个被不祥地命名为 null:

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+