MySQL 根据标识符从连续行的累积结果中获取变化
MySQL Get Change From Cumulative Results in Consecutive Rows by Identifier
我是运行MySQL社区服务器版本8.0.19.
我在处理公开可用的 COVID19 数据时一直在努力解决以下问题。我使用的数据集既可靠又质量好,但是数据 (total_confirmed) 是使用 累积 总数而不是每日感染计数报告的:
+----------------+---------------------+-----------------+
| country_region | date | total_confirmed |
+----------------+---------------------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 423 |
| Albania | 2020-04-05 00:00:00 | 361 |
| Albania | 2020-04-06 00:00:00 | 377 |
| Albania | 2020-04-07 00:00:00 | 383 |
| Algeria | 2020-04-05 00:00:00 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 1468 |
+----------------+---------------------+-----------------+
我的要求是既要有累计计数又要有每日新增病例。有一个很好的解决方案可以做到这一点 并且它对我的数据集很有用,前提是我只关注一个国家(在这个例子中我使用了一个 table 填充了阿富汗数据):
SET @prev := NULL;
SELECT country_region
,`date` AS DateCreated
,total_confirmed - coalesce(@prev, total_confirmed) AS new_cases
,(@prev := total_confirmed) AS total_confirmed
FROM (
SELECT * FROM so_confirmed ORDER BY `date`
) t1
GROUP BY
country_region, total_confirmed, `date`
ORDER BY country_region, DateCreated;
输出:
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 18 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 56 | 423 |
+----------------+---------------------+-----------+-----------------+
然而,数据中存在多于一 country_region 的分钟,它完全失败了,我不知道 SQL 足够了解我需要更改的内容。
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | -953 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | -1000 | 423 |
| Albania | 2020-04-05 00:00:00 | 12 | 361 |
| Albania | 2020-04-06 00:00:00 | 10 | 377 |
| Albania | 2020-04-07 00:00:00 | -40 | 383 |
| Algeria | 2020-04-05 00:00:00 | 959 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 1046 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 1085 | 1468 |
+----------------+---------------------+-----------+-----------------+
期望的输出:
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 18 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 56 | 423 |
| Albania | 2020-04-05 00:00:00 | 0 | 361 |
| Albania | 2020-04-06 00:00:00 | 16 | 377 |
| Albania | 2020-04-07 00:00:00 | 6 | 383 |
| Algeria | 2020-04-05 00:00:00 | 0 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 103 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 45 | 1468 |
+----------------+---------------------+-----------+-----------------+
如有任何帮助,我们将不胜感激。显然,在真实世界的数据集中,new_cases 值在 2020 年 4 月 5 日不会是 0,但在这个样本数据集中是正确的。
如果您是 运行 MySQL 8.0,您可以使用 window 函数 lag()
:
select
sc.*,
coalesce(
total_confirmed - lag(total_confirmed) over(partition by country_region order by datecreated),
0
) new_cases
from so_confirmed sc;
您可以使用lag()
的三参数形式:
select sc.*,
(total_confirmed -
lag(total_confirmed, 1, total_confirmed) over (partition by country_region order by date_created)
) as new_cases
from so_confirmed sc;
在 MySQL 的旧版本中,您可以使用联接,假设没有缺失日期:
select sc.*,
coalesce(sc.total_confirmed - sc_prev.total_confirmed, 0) as new_cases
from so_confirmed sc left join
so_confirmed sc_prev
on sc_prev.country_region = sc.country_region and
sc_prev.datecreated = sc.datecreated - interval 1 day;
我是运行MySQL社区服务器版本8.0.19.
我在处理公开可用的 COVID19 数据时一直在努力解决以下问题。我使用的数据集既可靠又质量好,但是数据 (total_confirmed) 是使用 累积 总数而不是每日感染计数报告的:
+----------------+---------------------+-----------------+
| country_region | date | total_confirmed |
+----------------+---------------------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 423 |
| Albania | 2020-04-05 00:00:00 | 361 |
| Albania | 2020-04-06 00:00:00 | 377 |
| Albania | 2020-04-07 00:00:00 | 383 |
| Algeria | 2020-04-05 00:00:00 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 1468 |
+----------------+---------------------+-----------------+
我的要求是既要有累计计数又要有每日新增病例。有一个很好的解决方案可以做到这一点
SET @prev := NULL;
SELECT country_region
,`date` AS DateCreated
,total_confirmed - coalesce(@prev, total_confirmed) AS new_cases
,(@prev := total_confirmed) AS total_confirmed
FROM (
SELECT * FROM so_confirmed ORDER BY `date`
) t1
GROUP BY
country_region, total_confirmed, `date`
ORDER BY country_region, DateCreated;
输出:
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 18 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 56 | 423 |
+----------------+---------------------+-----------+-----------------+
然而,数据中存在多于一 country_region 的分钟,它完全失败了,我不知道 SQL 足够了解我需要更改的内容。
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | -953 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | -1000 | 423 |
| Albania | 2020-04-05 00:00:00 | 12 | 361 |
| Albania | 2020-04-06 00:00:00 | 10 | 377 |
| Albania | 2020-04-07 00:00:00 | -40 | 383 |
| Algeria | 2020-04-05 00:00:00 | 959 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 1046 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 1085 | 1468 |
+----------------+---------------------+-----------+-----------------+
期望的输出:
+----------------+---------------------+-----------+-----------------+
| country_region | DateCreated | new_cases | total_confirmed |
+----------------+---------------------+-----------+-----------------+
| Afghanistan | 2020-04-05 00:00:00 | 0 | 349 |
| Afghanistan | 2020-04-06 00:00:00 | 18 | 367 |
| Afghanistan | 2020-04-07 00:00:00 | 56 | 423 |
| Albania | 2020-04-05 00:00:00 | 0 | 361 |
| Albania | 2020-04-06 00:00:00 | 16 | 377 |
| Albania | 2020-04-07 00:00:00 | 6 | 383 |
| Algeria | 2020-04-05 00:00:00 | 0 | 1320 |
| Algeria | 2020-04-06 00:00:00 | 103 | 1423 |
| Algeria | 2020-04-07 00:00:00 | 45 | 1468 |
+----------------+---------------------+-----------+-----------------+
如有任何帮助,我们将不胜感激。显然,在真实世界的数据集中,new_cases 值在 2020 年 4 月 5 日不会是 0,但在这个样本数据集中是正确的。
如果您是 运行 MySQL 8.0,您可以使用 window 函数 lag()
:
select
sc.*,
coalesce(
total_confirmed - lag(total_confirmed) over(partition by country_region order by datecreated),
0
) new_cases
from so_confirmed sc;
您可以使用lag()
的三参数形式:
select sc.*,
(total_confirmed -
lag(total_confirmed, 1, total_confirmed) over (partition by country_region order by date_created)
) as new_cases
from so_confirmed sc;
在 MySQL 的旧版本中,您可以使用联接,假设没有缺失日期:
select sc.*,
coalesce(sc.total_confirmed - sc_prev.total_confirmed, 0) as new_cases
from so_confirmed sc left join
so_confirmed sc_prev
on sc_prev.country_region = sc.country_region and
sc_prev.datecreated = sc.datecreated - interval 1 day;