MySQL 中的回归分析

Regression analysis in MySQL

简介
在我的项目中,我正在保存 FacebookPages 及其点赞数,以及每个国家/地区的点赞数。我有一个 table 用于 Facebook 页面,一个用于语言,一个用于 facebook 页面和语言之间的相关性(并计算喜欢)和一个 table 将此数据保存为历史记录。我想做的是在特定时间段内获得点赞增加最多的页面。

要使用的数据

我正在从创建查询中删除不相关的信息。

Table 包含所有 facebook 页面

CREATE TABLE `pages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `facebook_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `facebook_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `facebook_likes` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

示例数据:

INSERT INTO `facebook_pages` (`id`, `facebook_id`, `facebook_name`, `facebook_likes`)
VALUES
    (1, '552825254796051', 'Mesut Özil', 28593755),
    (2, '134904013188254', 'Borussia Dortmund', 13213354),
    (3, '310111039010406', 'Marco Reus', 12799627);

Table 包含所有语言

CREATE TABLE `languages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

示例数据

INSERT INTO `languages` (`id`, `language`)
VALUES
    (1, 'ID'),
    (2, 'TR'),
    (3, 'BR');

Table 包含相关性

CREATE TABLE `language_page_likes` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language_id` int(10) unsigned NOT NULL,
  `facebook_page_id` int(10) unsigned NOT NULL,
  `likes` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  // Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

示例数据

INSERT INTO `language_page_likes` (`id`, `language_id`, `facebook_page_id`)
VALUES
    (1, 1, 1),
    (2, 2, 1),
    (3, 3, 1),
    (47, 3, 2),
    (51, 1, 2),
    (53, 2, 2),
    (92, 3, 3),
    (95, 2, 3),
    (97, 1, 3);

Table包含历史

CREATE TABLE `language_page_likes_history` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language_page_likes_id` int(10) unsigned NOT NULL,
  `likes` int(11) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  // Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

示例数据

INSERT INTO `language_page_likes_history` (`id`, `language_page_likes_id`, `likes`, `created_at`)
VALUES
    (1, 1, 3272484, '2015-09-11 08:40:23'),
    (132014, 1, 3272827, '2015-09-14 08:31:00'),
    (2, 2, 1581361, '2015-09-11 08:40:23'),
    (132015, 2, 1580392, '2015-09-14 08:31:00'),
    (3, 3, 1467090, '2015-09-11 08:40:23'),
    (132016, 3, 1467329, '2015-09-14 08:31:00'),
    (47, 47, 828736, '2015-09-11 08:40:23'),
    (132060, 47, 828971, '2015-09-14 08:31:00'),
    (51, 51, 602747, '2015-09-11 08:40:23'),
    (132064, 51, 603071, '2015-09-14 08:31:00'),
    (53, 53, 545484, '2015-09-11 08:40:23'),
    (132066, 53, 545092, '2015-09-14 08:31:00'),
    (92, 92, 916570, '2015-09-11 08:40:24'),
    (132105, 92, 917032, '2015-09-14 08:31:01'),
    (95, 95, 537382, '2015-09-11 08:40:24'),
    (132108, 95, 537395, '2015-09-14 08:31:01'),
    (97, 97, 419175, '2015-09-11 08:40:24'),
    (132110, 97, 419484, '2015-09-14 08:31:01');

如你所见,我得到了 9 月 14 日和 11 日的数据。现在我想得到这个网站,点赞数增加最多的。在我用名为 last_like_count 的列完成之前,但问题是,我不能在日期范围内动态显示。使用 "normal" 回归函数,我可以在每个日期范围内保持动态。

寻找解决方案
我已经设法做到的是建立所有存在的关系

SELECT p.id, p.facebook_name, plh.likes, l.language FROM facebook_pages p
INNER JOIN language_page_likes pl ON pl.facebook_page_id = p.id
INNER JOIN language_page_likes_history plh ON plh.language_page_likes_id = pl.id
INNER JOIN languages l ON l.id = pl.language_id
WHERE pl.language_id = 5 OR pl.language_id = 46 OR pl.language_id = 68

通过该查询,我获得了系统历史记录中针对特定语言的所有点赞次数。但是,我将如何对该部分进行回归分析?

我已经在这里 link 找到了这个

Identifying trend with SQL query

但我的数学和 MySQL 技能还不够高,无法将 SQL 转换为 MySQL。有帮助吗?

这就是我现在能想到的。我无法正确测试此查询,因为现在我没有时间在 Web 的 sql 测试页面之一中创建这些 table 结构。但我认为即使它最初不起作用,它也可以为您指明正确的方向。

select 
    id, 
    new_date,
    max(increase)
from (
select 
    dg.id, 
    dg.date new_date, 
    dg.sum - (select sum from dg where dg.date = date_format((date_sub(str_to_date(new_date, '%Y-%m-%d') 1 DAY), '%Y-%m-%d') increase
from (
select 
    language_pages_likes_id id,
    date_format(created_at, '%Y%-m$-%d') date,
    sum(likes) likes_sum
from
    language_page_likes_history lplh
group by
    language_page_likes_id,
    date_format(created_at, '%Y%-m$-%d')
) day_grouping dg
) calculate_increases

希望对您有所帮助。以后有时间我会进一步测试和改进这个查询。

这可能是您正在寻找的:

SELECT SUM((X-AVG_X)*(Y-AVG_Y)) / SUM((X-AVG_X)*(X-AVG_X)) AS Slope,
       PageId, LanguageId
FROM
(
SELECT Q0.Y, 
       Q0.X, 
       Q1.AVG_Y,
       Q1.AVG_X,
       Q1.PageId,
       Q1.LanguageId
FROM   (SELECT T0.likes AS Y,
               UNIX_TIMESTAMP(T0.created_at) AS X,
               T1.facebook_page_id AS PageId,
               T1.language_id AS LanguageId
        FROM   language_page_likes_history T0 INNER JOIN
               language_page_likes T1 ON 
               (T0.language_page_likes_id = T1.id)
        WHERE  T0.created_at > '2015-09-11 00:00:00' AND
               T0.created_at < '2015-09-15 00:00:00') Q0 INNER JOIN
       (SELECT AVG(T2.likes) AS AVG_Y,
               AVG(UNIX_TIMESTAMP(T2.created_at)) AS AVG_X,
               T3.facebook_page_id AS PageId,
               T3.language_id AS LanguageId
        FROM   language_page_likes_history T2 INNER JOIN
               language_page_likes T3 ON 
               (T2.language_page_likes_id = T3.id)
        WHERE  T2.created_at > '2015-09-11 00:00:00' AND
               T2.created_at < '2015-09-15 00:00:00'
        GROUP BY T3.facebook_page_id, T3.language_id) Q1
        ON (Q0.PageId = Q1.PageId) AND (Q0.LanguageId = Q1.LanguageId)
) Q2
GROUP BY PageId, LanguageId
ORDER BY Slope DESC

它 returns 每个页面和语言的线性回归的斜率。斜率列表示每秒点赞数。在您的示例数据中,有两种情况下喜欢的数量减少了。我不知道为什么。输出应如下所示。 SQL 语句已经过测试,我手动检查了两行计算的正确输出。

|           Slope | PageId | LanguageId |
|-----------------|--------|------------|
|  0.001786287345 |      3 |          3 |
|  0.001326183029 |      1 |          1 |
|  0.001252720995 |      2 |          1 |
|  0.001194724653 |      3 |          1 |
|  0.000924075055 |      1 |          3 |
|  0.000908609364 |      2 |          3 |
|  0.000050263497 |      3 |          2 |
| -0.001515637747 |      2 |          2 |
| -0.003746563717 |      1 |          2 |

如果表中没有数据,则可能存在问题。所以也许必须添加 ISNULL 检查。


如果您只想知道绝对值,那就更简单了。您可以采取以下声明:

SELECT PageId, LanguageId,
       (likes_last_in_period - likes_before_period) AS Likes
FROM
(SELECT T1.facebook_page_id AS PageId,
       T1.language_id AS LanguageId,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at < '2015-09-12 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at DESC LIMIT 1) likes_before_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at >= '2015-09-12 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at ASC LIMIT 1) likes_first_in_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at <= '2015-09-15 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at DESC LIMIT 1) likes_last_in_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at > '2015-09-15 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at ASC LIMIT 1) likes_after_period

        FROM   language_page_likes T1) Q0
ORDER BY Likes DESC

其中有 4 个子查询。只需要两个,你必须选择。我选择使用期间之前的点赞数和期间内的最后点赞数来计算差异。结果如下所示:

| PageId | LanguageId | Likes |
|--------|------------|-------|
|      3 |          3 |   462 |
|      1 |          1 |   343 |
|      2 |          1 |   324 |
|      3 |          1 |   309 |
|      1 |          3 |   239 |
|      2 |          3 |   235 |
|      3 |          2 |    13 |
|      2 |          2 |  -392 |
|      1 |          2 |  -969 |