为 CTE 修改 SQLite 查询

Modify SQLite query for CTE

这是我的查询:

    WITH desc_table(counter, hourly, current_weather_description, current_icons, time_stamp) AS (
Select count(*) AS counter, CASE WHEN  strftime('%M',  'now') < '30' 
                THEN strftime('%H', 'now')  
                ELSE strftime('%H', time_stamp, '+1 hours') END as hourly, 
                current_weather_description,
                current_icons,
                time_stamp
                From weather_events
                GROUP BY strftime('%H',  time_stamp, '+30 minutes'), current_weather_description
                UNION ALL
                Select count(*) as counter, hourly - 1, current_weather_description, current_icons, time_stamp
                From weather_events
                GROUP BY strftime('%H',  time_stamp, '+30 minutes'), current_weather_description
                Order By counter desc limit 1
                ),
        avg_temp_table(avg_temp, hour_seg, time_stamp) AS (
        select avg(current_temperatures) as avg_temp, CASE WHEN  strftime('%M',  time_stamp) < '30' 
                THEN strftime('%H', time_stamp)  
                ELSE strftime('%H', time_stamp, '+1 hours') END as hour_seg, 
                time_stamp
                from weather_events
                group by strftime('%H',  time_stamp, '+30 minutes')
                order by hour_seg desc
                )

                Select  hourly, current_weather_description
                from desc_table
                join avg_temp_table
                on desc_table.hourly=avg_temp_table.hour_seg

基本上我有一些天气数据,我将它们分组为小时间隔(偏移 30 分钟),我想具体计算在该时间间隔内我获得特定天气描述(和匹配图标)的次数和 select 该时间间隔内发生次数最多(计数)的天气描述(desc_table)。然后我想获得该时间段内的平均温度 ((avg_temp_table)(也许我需要一个子查询?要执行此 avg 而不是我如何拥有它)并沿着它们的小时列加入两个查询。

我希望我的锚点基于(现在)进行查询的时间并计算出现次数,然后下一个成员每次减去一个小时并转到下一个时间间隔并计算等等。

示例数据,对于常规数据集{current_temperatures、current_weather_description、current_icons、time_stamp},每个时间段内会有更多行:

"87"    "Rain"  "rainicon"  "2016-01-20 02:15:08"
"65"    "Snow"  "snowicon"  "2016-01-20 02:39:08"
"49"    "Rain"  "rainicon"  "2016-01-20 03:15:08"
"49"    "Rain"  "rainicon"  "2016-01-20 03:39:08"
"46"    "Clear" "clearicon" "2016-01-20 04:15:29"
"46"    "Clear" "clearicon" "2016-01-20 04:38:53"
"46"    "Cloudy" "cloudyicon" "2016-01-20 05:15:08"
"46"    "Clear" "clearicon" "2016-01-20 05:39:08"
"45"    "Clear" "clearicon" "2016-01-20 06:14:17"
"45"    "Clear" "clearicon" "2016-01-20 06:34:23"
"45"    "Clear" "clearicon" "2016-01-20 07:24:54"
"45"    "Rain"  "rainicon"  "2016-01-20 07:44:41"
"43"    "Rain"  "rainicon"  "2016-01-20 08:19:08"
"36"    "Clear" "clearicon" "2016-01-20 08:39:08"
"35"    "Meatballs" "meatballsicon" "2016-01-20 09:18:08"
"18"    "Cloudy" "cloudyicon" "2016-01-20 09:39:08"

输出是时间间隔 (avg_temp_table) 的平均温度与第一个聚合 CTE (desc_table) 的输出之间的连接 {avg_temp, weather_description, current_icon}:

"87"    "Rain"  "rainicon"
"57"    "Rain"  "rainicon"
"47"    "Clear" "clearicon"
"46"    "Clear" "clearicon"
"46"    "Cloudy" "cloudyicon"
"45"    "Clear" "clearicon"
"44"    "Rain"  "rainicon"
"36"    "Clear" "clearicon"
"18"    "Cloudy" "cloudyicon"

现在我没有收到这样的列错误,因为我的锚点来自我的 weather_events table,我的递归成员也是如此。当我将递归成员从 desc_table 更改为 "recursive aggregate queries not supported error" 时,我得到了 "recursive aggregate queries not supported error"。但我不想从 desc_table 中获取我的递归成员,我想按小时分段,然后遍历每个小时间隔并获取计数。我猜我一开始也做错了锚点。

我仍然不确定你的 desc_table 递归 CTE 应该如何选择每小时出现次数最多的天气描述及其图标,但这没关系,因为使用你的口头描述,我想我有找到了一种无需递归即可完成相同操作的方法。

首先,按小时和描述对结果进行分组,并计算每组中的行数:

SELECT
  strftime('%H', time_stamp, '+30 minutes') AS hour,
  current_weather_description,
  current_icons,
  COUNT(*) AS event_count
FROM
  weather_events
GROUP BY
  strftime('%H', time_stamp, '+30 minutes'),
  current_weather_description

作为下一步,按小时对上述查询的结果进行分组,并获得每小时的最大事件数:

SELECT
  hour,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour

这仍然不是您想要的,因为您实际上想要描述和图标匹配最大计数,而不是计数本身。好吧,这很容易修复——只需将这些列添加到 SELECT 而无需将它们添加到 GROUP BY:

SELECT
  hour,
  <b>current_weather_description,
  current_icons,</b>
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour

您仍然需要在查询中保留 MAX(event_count) 才能使技巧生效。它起作用的原因是因为在 SQLite 中,当 SELECT 语句包含单个 MAX 或单个 MIN 调用时,任何 selected 列的值既不在 GROUP BY 中也不在aggregated 将从与所述 MAX 或 MIN 值匹配的行中获取。 SQL 的 non-standard 扩展记录在 release notes for SQLite 3.7.11.

desc_table就这么多了。至于 avg_temp_table CTE,您当前的方法似乎没有任何问题,除了为了保持一致性,我可能会使用 GROUP BY 表达式而不是您正在使用的 CASE 表达式作为小时定义,并且 time_stamp 似乎对结果也是多余的。所以稍微修改的 CTE 看起来像这样:

SELECT
  strftime('%H', time_stamp, '+30 minutes') AS hour,
  AVG(current_temperatures) AS avg_temp
FROM
  weather_events
GROUP BY
  strftime('%H', time_stamp, '+30 minutes')

现在您只需要 join the two sets hour 列和 select 最终输出的相关列:

SELECT
  t.avg_temp,
  d.current_weather_description,
  d.current_icons
FROM
  avg_temp_table AS t
  INNER JOIN desc_table AS d on t.hour = d.hour
ORDER BY
  t.hour

所以你来了。现在我只想解决一个与结果查询有关的问题,即

是否可以避免连接?

虽然您的解决方法——分别获取描述和平均温度,然后将这两个集合连接在一起——很简单并且非常有意义,但最好避免连接并同时进行所有计算。这很可能会使查询更快,因为源将只被扫描一次。这能实现吗?

碰巧,是的,可以。将这两个部分结合起来的主要困难在于,描述是分两步获得的,而平均温度的计算是 single-step 操作。简单地将 AVG(current_temperatures) 放入第一个 CTE 的嵌套 SELECT(按小时和描述分组),然后对外部 SELECT(按小时分组)中的结果进行 AVG 是不等价的,从数学上讲,在整个小时组内进行一次 AVG。

相反,您需要记住 AVG = SUM / COUNT。如果您在第一步中获得 SUM 和 COUNT,然后在第二步中获得 SUM 的 SUM 和 COUNT 的 SUM,则只需将第一个外部 SUM 除以第二个外部 SUM 即可得到平均值。

这里是新的 desc_table CTE,修改后结合了查询的两个部分(因此它不再应该是 CTE,而是 complete query),必要的更改以粗体突出显示:

SELECT
  <b>SUM(total_temp) / SUM(event_count) AS avg_temp,</b>
  current_weather_description,
  current_icons,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count,
      <b>SUM(current_temperatures) AS total_temp</b>
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour
ORDER BY
  hour
;

显然,max_event_count 列对于输出来说是多余的——并且对于查询所依赖的“每组最大 N”方法仍然至关重要。就我个人而言,在这种情况下我不会担心一个冗余列,但是如果您有充分的理由将其排除在结果集中,您可以将上述查询用作派生的 table(是的,再次)并且最外面的 SELECT 拉取除 max_event_count 之外的所有列——例如,像这样:

SELECT
  avg_temp,
  current_weather_description,
  current_icons
FROM
  (
    SELECT
      hour,
      SUM(total_temp) / SUM(event_count) AS avg_temp,
      current_weather_description,
      current_icons,
      MAX(event_count) AS max_event_count
    FROM
      (
        SELECT
          strftime('%H', time_stamp, '+30 minutes') AS hour,
          current_weather_description,
          current_icons,
          COUNT(*) AS event_count,
          SUM(current_temperatures) AS total_temp
        FROM
          weather_events
        GROUP BY
          strftime('%H', time_stamp, '+30 minutes'),
          current_weather_description
      ) AS s
    GROUP BY
      hour
  ) AS s
ORDER BY
  hour desc
;

如您所见,middle-tier SELECT 现在也包括 hour,这是最外层的 ORDER BY 所需要的。 (我在这里假设顺序对于调用应用程序很重要。)

我只需要提一下这两种方法的结果之间的一个区别。在第一个中,AVG(current_temperatures) 给你一个 floating-point 结果。在第二个中,SUM(total_temp) / SUM(event_count) 给你一个整数。由于您的预期结果显示整数平均值,我想这应该不是问题。但是,如果您以后决定希望平均值更精确,请记住,您可以将 SUM(total_temp)SUM(current_temperatures) 中的 SUM 函数替换为 TOTAL 函数,returns 相同值作为 SUM,但结果始终是 real。将 real 除以 integer 在 SQLite 中得到 real,因此使用 TOTAL 您将获得与第一种方法中的 AVG 相同的结果。