MySQL 中的回归分析
Regression analysis in MySQL
简介
在我的项目中,我正在保存 FacebookPages 及其点赞数,以及每个国家/地区的点赞数。我有一个 table 用于 Facebook 页面,一个用于语言,一个用于 facebook 页面和语言之间的相关性(并计算喜欢)和一个 table 将此数据保存为历史记录。我想做的是在特定时间段内获得点赞增加最多的页面。
要使用的数据
我正在从创建查询中删除不相关的信息。
Table 包含所有 facebook 页面
CREATE TABLE `pages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`facebook_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`facebook_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`facebook_likes` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据:
INSERT INTO `facebook_pages` (`id`, `facebook_id`, `facebook_name`, `facebook_likes`)
VALUES
(1, '552825254796051', 'Mesut Özil', 28593755),
(2, '134904013188254', 'Borussia Dortmund', 13213354),
(3, '310111039010406', 'Marco Reus', 12799627);
Table 包含所有语言
CREATE TABLE `languages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据
INSERT INTO `languages` (`id`, `language`)
VALUES
(1, 'ID'),
(2, 'TR'),
(3, 'BR');
Table 包含相关性
CREATE TABLE `language_page_likes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language_id` int(10) unsigned NOT NULL,
`facebook_page_id` int(10) unsigned NOT NULL,
`likes` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
// Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据
INSERT INTO `language_page_likes` (`id`, `language_id`, `facebook_page_id`)
VALUES
(1, 1, 1),
(2, 2, 1),
(3, 3, 1),
(47, 3, 2),
(51, 1, 2),
(53, 2, 2),
(92, 3, 3),
(95, 2, 3),
(97, 1, 3);
Table包含历史
CREATE TABLE `language_page_likes_history` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language_page_likes_id` int(10) unsigned NOT NULL,
`likes` int(11) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
// Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据
INSERT INTO `language_page_likes_history` (`id`, `language_page_likes_id`, `likes`, `created_at`)
VALUES
(1, 1, 3272484, '2015-09-11 08:40:23'),
(132014, 1, 3272827, '2015-09-14 08:31:00'),
(2, 2, 1581361, '2015-09-11 08:40:23'),
(132015, 2, 1580392, '2015-09-14 08:31:00'),
(3, 3, 1467090, '2015-09-11 08:40:23'),
(132016, 3, 1467329, '2015-09-14 08:31:00'),
(47, 47, 828736, '2015-09-11 08:40:23'),
(132060, 47, 828971, '2015-09-14 08:31:00'),
(51, 51, 602747, '2015-09-11 08:40:23'),
(132064, 51, 603071, '2015-09-14 08:31:00'),
(53, 53, 545484, '2015-09-11 08:40:23'),
(132066, 53, 545092, '2015-09-14 08:31:00'),
(92, 92, 916570, '2015-09-11 08:40:24'),
(132105, 92, 917032, '2015-09-14 08:31:01'),
(95, 95, 537382, '2015-09-11 08:40:24'),
(132108, 95, 537395, '2015-09-14 08:31:01'),
(97, 97, 419175, '2015-09-11 08:40:24'),
(132110, 97, 419484, '2015-09-14 08:31:01');
如你所见,我得到了 9 月 14 日和 11 日的数据。现在我想得到这个网站,点赞数增加最多的。在我用名为 last_like_count 的列完成之前,但问题是,我不能在日期范围内动态显示。使用 "normal" 回归函数,我可以在每个日期范围内保持动态。
寻找解决方案
我已经设法做到的是建立所有存在的关系
SELECT p.id, p.facebook_name, plh.likes, l.language FROM facebook_pages p
INNER JOIN language_page_likes pl ON pl.facebook_page_id = p.id
INNER JOIN language_page_likes_history plh ON plh.language_page_likes_id = pl.id
INNER JOIN languages l ON l.id = pl.language_id
WHERE pl.language_id = 5 OR pl.language_id = 46 OR pl.language_id = 68
通过该查询,我获得了系统历史记录中针对特定语言的所有点赞次数。但是,我将如何对该部分进行回归分析?
我已经在这里 link 找到了这个
Identifying trend with SQL query
但我的数学和 MySQL 技能还不够高,无法将 SQL 转换为 MySQL。有帮助吗?
这就是我现在能想到的。我无法正确测试此查询,因为现在我没有时间在 Web 的 sql 测试页面之一中创建这些 table 结构。但我认为即使它最初不起作用,它也可以为您指明正确的方向。
select
id,
new_date,
max(increase)
from (
select
dg.id,
dg.date new_date,
dg.sum - (select sum from dg where dg.date = date_format((date_sub(str_to_date(new_date, '%Y-%m-%d') 1 DAY), '%Y-%m-%d') increase
from (
select
language_pages_likes_id id,
date_format(created_at, '%Y%-m$-%d') date,
sum(likes) likes_sum
from
language_page_likes_history lplh
group by
language_page_likes_id,
date_format(created_at, '%Y%-m$-%d')
) day_grouping dg
) calculate_increases
希望对您有所帮助。以后有时间我会进一步测试和改进这个查询。
这可能是您正在寻找的:
SELECT SUM((X-AVG_X)*(Y-AVG_Y)) / SUM((X-AVG_X)*(X-AVG_X)) AS Slope,
PageId, LanguageId
FROM
(
SELECT Q0.Y,
Q0.X,
Q1.AVG_Y,
Q1.AVG_X,
Q1.PageId,
Q1.LanguageId
FROM (SELECT T0.likes AS Y,
UNIX_TIMESTAMP(T0.created_at) AS X,
T1.facebook_page_id AS PageId,
T1.language_id AS LanguageId
FROM language_page_likes_history T0 INNER JOIN
language_page_likes T1 ON
(T0.language_page_likes_id = T1.id)
WHERE T0.created_at > '2015-09-11 00:00:00' AND
T0.created_at < '2015-09-15 00:00:00') Q0 INNER JOIN
(SELECT AVG(T2.likes) AS AVG_Y,
AVG(UNIX_TIMESTAMP(T2.created_at)) AS AVG_X,
T3.facebook_page_id AS PageId,
T3.language_id AS LanguageId
FROM language_page_likes_history T2 INNER JOIN
language_page_likes T3 ON
(T2.language_page_likes_id = T3.id)
WHERE T2.created_at > '2015-09-11 00:00:00' AND
T2.created_at < '2015-09-15 00:00:00'
GROUP BY T3.facebook_page_id, T3.language_id) Q1
ON (Q0.PageId = Q1.PageId) AND (Q0.LanguageId = Q1.LanguageId)
) Q2
GROUP BY PageId, LanguageId
ORDER BY Slope DESC
它 returns 每个页面和语言的线性回归的斜率。斜率列表示每秒点赞数。在您的示例数据中,有两种情况下喜欢的数量减少了。我不知道为什么。输出应如下所示。 SQL 语句已经过测试,我手动检查了两行计算的正确输出。
| Slope | PageId | LanguageId |
|-----------------|--------|------------|
| 0.001786287345 | 3 | 3 |
| 0.001326183029 | 1 | 1 |
| 0.001252720995 | 2 | 1 |
| 0.001194724653 | 3 | 1 |
| 0.000924075055 | 1 | 3 |
| 0.000908609364 | 2 | 3 |
| 0.000050263497 | 3 | 2 |
| -0.001515637747 | 2 | 2 |
| -0.003746563717 | 1 | 2 |
如果表中没有数据,则可能存在问题。所以也许必须添加 ISNULL 检查。
如果您只想知道绝对值,那就更简单了。您可以采取以下声明:
SELECT PageId, LanguageId,
(likes_last_in_period - likes_before_period) AS Likes
FROM
(SELECT T1.facebook_page_id AS PageId,
T1.language_id AS LanguageId,
(SELECT likes
FROM language_page_likes_history
WHERE created_at < '2015-09-12 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at DESC LIMIT 1) likes_before_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at >= '2015-09-12 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at ASC LIMIT 1) likes_first_in_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at <= '2015-09-15 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at DESC LIMIT 1) likes_last_in_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at > '2015-09-15 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at ASC LIMIT 1) likes_after_period
FROM language_page_likes T1) Q0
ORDER BY Likes DESC
其中有 4 个子查询。只需要两个,你必须选择。我选择使用期间之前的点赞数和期间内的最后点赞数来计算差异。结果如下所示:
| PageId | LanguageId | Likes |
|--------|------------|-------|
| 3 | 3 | 462 |
| 1 | 1 | 343 |
| 2 | 1 | 324 |
| 3 | 1 | 309 |
| 1 | 3 | 239 |
| 2 | 3 | 235 |
| 3 | 2 | 13 |
| 2 | 2 | -392 |
| 1 | 2 | -969 |
简介
在我的项目中,我正在保存 FacebookPages 及其点赞数,以及每个国家/地区的点赞数。我有一个 table 用于 Facebook 页面,一个用于语言,一个用于 facebook 页面和语言之间的相关性(并计算喜欢)和一个 table 将此数据保存为历史记录。我想做的是在特定时间段内获得点赞增加最多的页面。
要使用的数据
我正在从创建查询中删除不相关的信息。
Table 包含所有 facebook 页面
CREATE TABLE `pages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`facebook_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`facebook_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`facebook_likes` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据:
INSERT INTO `facebook_pages` (`id`, `facebook_id`, `facebook_name`, `facebook_likes`)
VALUES
(1, '552825254796051', 'Mesut Özil', 28593755),
(2, '134904013188254', 'Borussia Dortmund', 13213354),
(3, '310111039010406', 'Marco Reus', 12799627);
Table 包含所有语言
CREATE TABLE `languages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据
INSERT INTO `languages` (`id`, `language`)
VALUES
(1, 'ID'),
(2, 'TR'),
(3, 'BR');
Table 包含相关性
CREATE TABLE `language_page_likes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language_id` int(10) unsigned NOT NULL,
`facebook_page_id` int(10) unsigned NOT NULL,
`likes` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
// Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据
INSERT INTO `language_page_likes` (`id`, `language_id`, `facebook_page_id`)
VALUES
(1, 1, 1),
(2, 2, 1),
(3, 3, 1),
(47, 3, 2),
(51, 1, 2),
(53, 2, 2),
(92, 3, 3),
(95, 2, 3),
(97, 1, 3);
Table包含历史
CREATE TABLE `language_page_likes_history` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language_page_likes_id` int(10) unsigned NOT NULL,
`likes` int(11) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
// Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
示例数据
INSERT INTO `language_page_likes_history` (`id`, `language_page_likes_id`, `likes`, `created_at`)
VALUES
(1, 1, 3272484, '2015-09-11 08:40:23'),
(132014, 1, 3272827, '2015-09-14 08:31:00'),
(2, 2, 1581361, '2015-09-11 08:40:23'),
(132015, 2, 1580392, '2015-09-14 08:31:00'),
(3, 3, 1467090, '2015-09-11 08:40:23'),
(132016, 3, 1467329, '2015-09-14 08:31:00'),
(47, 47, 828736, '2015-09-11 08:40:23'),
(132060, 47, 828971, '2015-09-14 08:31:00'),
(51, 51, 602747, '2015-09-11 08:40:23'),
(132064, 51, 603071, '2015-09-14 08:31:00'),
(53, 53, 545484, '2015-09-11 08:40:23'),
(132066, 53, 545092, '2015-09-14 08:31:00'),
(92, 92, 916570, '2015-09-11 08:40:24'),
(132105, 92, 917032, '2015-09-14 08:31:01'),
(95, 95, 537382, '2015-09-11 08:40:24'),
(132108, 95, 537395, '2015-09-14 08:31:01'),
(97, 97, 419175, '2015-09-11 08:40:24'),
(132110, 97, 419484, '2015-09-14 08:31:01');
如你所见,我得到了 9 月 14 日和 11 日的数据。现在我想得到这个网站,点赞数增加最多的。在我用名为 last_like_count 的列完成之前,但问题是,我不能在日期范围内动态显示。使用 "normal" 回归函数,我可以在每个日期范围内保持动态。
寻找解决方案
我已经设法做到的是建立所有存在的关系
SELECT p.id, p.facebook_name, plh.likes, l.language FROM facebook_pages p
INNER JOIN language_page_likes pl ON pl.facebook_page_id = p.id
INNER JOIN language_page_likes_history plh ON plh.language_page_likes_id = pl.id
INNER JOIN languages l ON l.id = pl.language_id
WHERE pl.language_id = 5 OR pl.language_id = 46 OR pl.language_id = 68
通过该查询,我获得了系统历史记录中针对特定语言的所有点赞次数。但是,我将如何对该部分进行回归分析?
我已经在这里 link 找到了这个
Identifying trend with SQL query
但我的数学和 MySQL 技能还不够高,无法将 SQL 转换为 MySQL。有帮助吗?
这就是我现在能想到的。我无法正确测试此查询,因为现在我没有时间在 Web 的 sql 测试页面之一中创建这些 table 结构。但我认为即使它最初不起作用,它也可以为您指明正确的方向。
select
id,
new_date,
max(increase)
from (
select
dg.id,
dg.date new_date,
dg.sum - (select sum from dg where dg.date = date_format((date_sub(str_to_date(new_date, '%Y-%m-%d') 1 DAY), '%Y-%m-%d') increase
from (
select
language_pages_likes_id id,
date_format(created_at, '%Y%-m$-%d') date,
sum(likes) likes_sum
from
language_page_likes_history lplh
group by
language_page_likes_id,
date_format(created_at, '%Y%-m$-%d')
) day_grouping dg
) calculate_increases
希望对您有所帮助。以后有时间我会进一步测试和改进这个查询。
这可能是您正在寻找的:
SELECT SUM((X-AVG_X)*(Y-AVG_Y)) / SUM((X-AVG_X)*(X-AVG_X)) AS Slope,
PageId, LanguageId
FROM
(
SELECT Q0.Y,
Q0.X,
Q1.AVG_Y,
Q1.AVG_X,
Q1.PageId,
Q1.LanguageId
FROM (SELECT T0.likes AS Y,
UNIX_TIMESTAMP(T0.created_at) AS X,
T1.facebook_page_id AS PageId,
T1.language_id AS LanguageId
FROM language_page_likes_history T0 INNER JOIN
language_page_likes T1 ON
(T0.language_page_likes_id = T1.id)
WHERE T0.created_at > '2015-09-11 00:00:00' AND
T0.created_at < '2015-09-15 00:00:00') Q0 INNER JOIN
(SELECT AVG(T2.likes) AS AVG_Y,
AVG(UNIX_TIMESTAMP(T2.created_at)) AS AVG_X,
T3.facebook_page_id AS PageId,
T3.language_id AS LanguageId
FROM language_page_likes_history T2 INNER JOIN
language_page_likes T3 ON
(T2.language_page_likes_id = T3.id)
WHERE T2.created_at > '2015-09-11 00:00:00' AND
T2.created_at < '2015-09-15 00:00:00'
GROUP BY T3.facebook_page_id, T3.language_id) Q1
ON (Q0.PageId = Q1.PageId) AND (Q0.LanguageId = Q1.LanguageId)
) Q2
GROUP BY PageId, LanguageId
ORDER BY Slope DESC
它 returns 每个页面和语言的线性回归的斜率。斜率列表示每秒点赞数。在您的示例数据中,有两种情况下喜欢的数量减少了。我不知道为什么。输出应如下所示。 SQL 语句已经过测试,我手动检查了两行计算的正确输出。
| Slope | PageId | LanguageId |
|-----------------|--------|------------|
| 0.001786287345 | 3 | 3 |
| 0.001326183029 | 1 | 1 |
| 0.001252720995 | 2 | 1 |
| 0.001194724653 | 3 | 1 |
| 0.000924075055 | 1 | 3 |
| 0.000908609364 | 2 | 3 |
| 0.000050263497 | 3 | 2 |
| -0.001515637747 | 2 | 2 |
| -0.003746563717 | 1 | 2 |
如果表中没有数据,则可能存在问题。所以也许必须添加 ISNULL 检查。
如果您只想知道绝对值,那就更简单了。您可以采取以下声明:
SELECT PageId, LanguageId,
(likes_last_in_period - likes_before_period) AS Likes
FROM
(SELECT T1.facebook_page_id AS PageId,
T1.language_id AS LanguageId,
(SELECT likes
FROM language_page_likes_history
WHERE created_at < '2015-09-12 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at DESC LIMIT 1) likes_before_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at >= '2015-09-12 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at ASC LIMIT 1) likes_first_in_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at <= '2015-09-15 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at DESC LIMIT 1) likes_last_in_period,
(SELECT likes
FROM language_page_likes_history
WHERE created_at > '2015-09-15 00:00:00' AND
language_page_likes_id = T1.id
ORDER BY created_at ASC LIMIT 1) likes_after_period
FROM language_page_likes T1) Q0
ORDER BY Likes DESC
其中有 4 个子查询。只需要两个,你必须选择。我选择使用期间之前的点赞数和期间内的最后点赞数来计算差异。结果如下所示:
| PageId | LanguageId | Likes |
|--------|------------|-------|
| 3 | 3 | 462 |
| 1 | 1 | 343 |
| 2 | 1 | 324 |
| 3 | 1 | 309 |
| 1 | 3 | 239 |
| 2 | 3 | 235 |
| 3 | 2 | 13 |
| 2 | 2 | -392 |
| 1 | 2 | -969 |