用 SQL 将缺失的金融时间序列数据从一个 Table 填充到另一个

Fill Missing Financial Time Series Data From One Table to Another with SQL

这是我在 Github 上构建的开源 python 项目的一部分,可以在此处找到:

fxcmminer

我有很多来自 FXCM 的金融时间序列数据,这些数据充满了空白。这些空白需要由数据库中的其他数据来填补,如果有人可以帮助我,我很困惑?

数据库和 tables 是使用 python 脚本创建的,可以在 here

中找到

下面是代码片段。

CREATE DATABASE IF NOT EXISTS fxcm_bar_GBPUSD;                                            
 CREATE TABLE IF NOT EXISTS fxcm_bar_GBPUSD.tbl_GBPUSD_m1;
  `date` DATETIME NOT NULL,
  `bidopen` DECIMAL(19,6) NULL,
  `bidhigh` DECIMAL(19,6) NULL,
  `bidlow` DECIMAL(19,6) NULL,
  `bidclose` DECIMAL(19,6) NULL,
  `askopen` DECIMAL(19,6) NULL,
  `askhigh` DECIMAL(19,6) NULL,
  `asklow` DECIMAL(19,6) NULL,
  `askclose` DECIMAL(19,6) NULL,
  `volume` BIGINT NULL,
  PRIMARY KEY (`date`))
ENGINE=InnoDB;

以下两个查询分别针对 1 分钟和 5 分钟的时间间隔,您可以看到在 1 分钟内缺少很多数据点。在我求助于 'predicting' 值之前,5 分钟内有一些数据点 table 可以帮助填补空白。

MariaDB [(none)]> select * from fxcm_bar_GBPUSD.tbl_GBPUSD_m1 where date >= "2002-3-31 17:00:00" and date <= "2002-3-31 18:00:00";
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| date                | bidopen  | bidhigh  | bidlow   | bidclose | askopen  | askhigh  | asklow   | askclose | volume |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| 2002-03-31 17:01:00 | 1.425900 | 1.425900 | 1.425800 | 1.425800 | 1.426200 | 1.426200 | 1.426100 | 1.426100 |      0 |
| 2002-03-31 17:15:00 | 1.425800 | 1.425800 | 1.425700 | 1.425800 | 1.426100 | 1.426100 | 1.426000 | 1.426100 |      0 |
| 2002-03-31 17:17:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:20:00 | 1.425600 | 1.425700 | 1.425500 | 1.425700 | 1.425900 | 1.426000 | 1.425800 | 1.426000 |      0 |
| 2002-03-31 17:22:00 | 1.425700 | 1.425800 | 1.425700 | 1.425800 | 1.426000 | 1.426100 | 1.426000 | 1.426100 |      0 |
| 2002-03-31 17:24:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:29:00 | 1.425600 | 1.425800 | 1.425600 | 1.425800 | 1.425900 | 1.426100 | 1.425900 | 1.426100 |      0 |
| 2002-03-31 17:31:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:48:00 | 1.425600 | 1.425600 | 1.425200 | 1.425200 | 1.425900 | 1.425900 | 1.425500 | 1.425500 |      0 |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
9 rows in set (0.00 sec)

MariaDB [(none)]> select * from fxcm_bar_GBPUSD.tbl_GBPUSD_m5 where date >= "2002-3-31 17:00:00" and date <= "2002-3-31 18:00:00";
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| date                | bidopen  | bidhigh  | bidlow   | bidclose | askopen  | askhigh  | asklow   | askclose | volume |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| 2002-03-31 17:00:00 | 1.425900 | 1.425900 | 1.425800 | 1.425800 | 1.426200 | 1.426200 | 1.426100 | 1.426100 |      0 |
| 2002-03-31 17:15:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:25:00 | 1.425600 | 1.425800 | 1.425600 | 1.425800 | 1.425900 | 1.426100 | 1.425900 | 1.426100 |      0 |
| 2002-03-31 17:30:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:45:00 | 1.425600 | 1.425600 | 1.425200 | 1.425200 | 1.425900 | 1.425900 | 1.425500 | 1.425500 |      0 |
| 2002-03-31 18:00:00 | 1.425200 | 1.425500 | 1.425200 | 1.425500 | 1.425500 | 1.425800 | 1.425500 | 1.425800 |      0 |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
7 rows in set (0.01 sec)

MariaDB [(none)]> 

同样,第 1 分钟 table 中有数据点,可以填补第 5 分钟 table 中缺失的数据点。

table 交换值后,它们看起来像这样。

+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| date                | bidopen  | bidhigh  | bidlow   | bidclose | askopen  | askhigh  | asklow   | askclose | volume |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| 2002-03-31 17:00:00 | 1.425900 | 1.425900 | 1.425800 | 1.425800 | 1.426200 | 1.426200 | 1.426100 | 1.426100 |      0 |
| 2002-03-31 17:01:00 | 1.425900 | 1.425900 | 1.425800 | 1.425800 | 1.426200 | 1.426200 | 1.426100 | 1.426100 |      0 |
| 2002-03-31 17:15:00 | 1.425800 | 1.425800 | 1.425700 | 1.425800 | 1.426100 | 1.426100 | 1.426000 | 1.426100 |      0 |
| 2002-03-31 17:17:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:20:00 | 1.425600 | 1.425700 | 1.425500 | 1.425700 | 1.425900 | 1.426000 | 1.425800 | 1.426000 |      0 |
| 2002-03-31 17:22:00 | 1.425700 | 1.425800 | 1.425700 | 1.425800 | 1.426000 | 1.426100 | 1.426000 | 1.426100 |      0 |
| 2002-03-31 17:24:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:25:00 | 1.425600 | 1.425800 | 1.425600 | 1.425800 | 1.425900 | 1.426100 | 1.425900 | 1.426100 |      0 |
| 2002-03-31 17:29:00 | 1.425600 | 1.425800 | 1.425600 | 1.425800 | 1.425900 | 1.426100 | 1.425900 | 1.426100 |      0 |
| 2002-03-31 17:30:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:31:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:45:00 | 1.425600 | 1.425600 | 1.425200 | 1.425200 | 1.425900 | 1.425900 | 1.425500 | 1.425500 |      0 |
| 2002-03-31 17:48:00 | 1.425600 | 1.425600 | 1.425200 | 1.425200 | 1.425900 | 1.425900 | 1.425500 | 1.425500 |      0 |
| 2002-03-31 18:00:00 | 1.425200 | 1.425500 | 1.425200 | 1.425500 | 1.425500 | 1.425800 | 1.425500 | 1.425800 |      0 |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+


+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| date                | bidopen  | bidhigh  | bidlow   | bidclose | askopen  | askhigh  | asklow   | askclose | volume |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+
| 2002-03-31 17:00:00 | 1.425900 | 1.425900 | 1.425800 | 1.425800 | 1.426200 | 1.426200 | 1.426100 | 1.426100 |      0 |
| 2002-03-31 17:15:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:20:00 | 1.425600 | 1.425700 | 1.425500 | 1.425700 | 1.425900 | 1.426000 | 1.425800 | 1.426000 |      0 |
| 2002-03-31 17:25:00 | 1.425600 | 1.425800 | 1.425600 | 1.425800 | 1.425900 | 1.426100 | 1.425900 | 1.426100 |      0 |
| 2002-03-31 17:30:00 | 1.425800 | 1.425800 | 1.425600 | 1.425600 | 1.426100 | 1.426100 | 1.425900 | 1.425900 |      0 |
| 2002-03-31 17:45:00 | 1.425600 | 1.425600 | 1.425200 | 1.425200 | 1.425900 | 1.425900 | 1.425500 | 1.425500 |      0 |
| 2002-03-31 18:00:00 | 1.425200 | 1.425500 | 1.425200 | 1.425500 | 1.425500 | 1.425800 | 1.425500 | 1.425800 |      0 |
+---------------------+----------+----------+----------+----------+----------+----------+----------+----------+--------+

还有缺失的数据点,不过刚补完的数据是真实数据

然后我将使用 python 在数据库之外执行进一步的数据插值,这不是这个问题的一部分。

我如何让这两个 table 交换并插入缺失的行而不会交叉污染?

谢谢

我猜这就是您想要的。虽然可能是错的。很难说。

您所有的出价*数据都不受任何操作的影响,因此您的问题似乎等同于 tables 带有日期(此处为 tp)以标识行和一些数据,此处抽象为文本(这里是 t)为了方便。

-- Example setup  
CREATE TABLE minutes1 (tp datetime, t text, PRIMARY KEY (tp));  
CREATE TABLE minutes5 (tp datetime, t text, PRIMARY KEY (tp));  

-- keep common data 00:00:00 as is  
INSERT INTO minutes1 VALUES ('2017-01-01 00:00:00', 'a');  
INSERT INTO minutes1 VALUES ('2017-01-01 00:01:00', 'b');  
-- add this 00:05:00 to minutes5 because would fit there and is missing  
INSERT INTO minutes1 VALUES ('2017-01-01 00:05:00', 'c');  
-- keep common data for 00:00:00 as is  
INSERT INTO minutes5 VALUES ('2017-01-01 00:00:00', '1');  
-- add this 00:10:00 to minutes1 because would fit there and is missing  
INSERT INTO minutes5 VALUES ('2017-01-01 00:10:00', '2');  

table minutes1  
tp                    | t  
'2017-01-01 00:00:00' | 'a'
'2017-01-01 00:01:00' | 'b'
'2017-01-01 00:05:00' | 'c'

table minutes5
tp                    | t
'2017-01-01 00:00:00' | '1'
'2017-01-01 00:10:00' | '2'

解决攻略

我们从不更改任何 table 中的现有数据。只插入缺失的部分。因此不会发生交叉污染:

  • 如果数据在两者中,则什么也不会发生。
  • 如果数据在其中一个,而另一个不在,则可以安全插入。
  • 如果数据不在任何一个中,那么我们无论如何也没有什么可传输的。
  • 始终尊重粒度。
    • 始终可以从 5 分钟步长插入到 1 分钟步长。
    • 只有当步长 n 能被 5 整除时,才能从 1 分钟步长插入到 5 分钟步长。

从分钟5转为分钟1

如果 minutes1 中的数据丢失,这始终是安全的,因为 minutes1 的粒度小于 minutes5。

INSERT INTO minutes1  
SELECT * FROM minutes5  
WHERE date NOT IN (SELECT date FROM minutes1);  

从分钟1转为分钟5

我们不能将 2 分钟的日期插入到 table 中,粒度为 5 分钟。 我们使用与上面相同的策略,使用额外的 WHERE MINUTE(date) % 5 = 0 子句来检查粒度。

INSERT INTO minutes5  
SELECT * FROM minutes1  
WHERE MINUTE(date) % 5 = 0 AND date NOT IN (SELECT date FROM minutes5);  

预期结果

SELECT * FROM minutes1;  
SELECT * FROM minutes5;  

table minutes1  
tp                    | t  
'2017-01-01 00:00:00' | 'a'
'2017-01-01 00:01:00' | 'b'  -- not added to minutes5
'2017-01-01 00:05:00' | 'c'
'2017-01-01 00:10:00' | '2'  -- copied from minutes5

table minutes5
tp                    | t
'2017-01-01 00:00:00' | '1'
'2017-01-01 00:10:00' | '2'
'2017-01-01 00:05:00' | 'c'  -- copied from minutes1

备注

您可以考虑添加一个 CHECK CONSTRAINT 以保证 minutes5 table 与 MINUTE(date) % 5 = 0 的完整性。请查阅您的 MariaDB 手册以获取有关如何实现此目的的说明。大概是这样的。

ALTER TABLE minutes5  
ADD CONSTRAINT check_minutes5_is_multiple_of_5  
CHECK (MINUTE(date) % 5 = 0);