如何将行从一个 table 移动到另一个行,而在第三行中不存在这些行?
How can I move rows from one table to another where they don't exist in a third?
我有三个table:
grade (grade_id, grade_value, grade_date) ~100M rows
grade_archive (grade_id, grade_value, grade_date) 0 rows
peer_review (grade_id, peer_review_value, peer_review_date) ~10M rows
我想移动所有从 table grade
到 grade_archive
超过一个月且不在 table peer_review
中的行。
tables 被积极使用,因此任何插入都必须是低优先级以避免在运行时中断现有和新进程。
预期 table 行完成后应如下所示:
grade ~10M rows
grade_archive ~90M rows
peer_review ~10M rows
我想它接近于:
INSERT
LOW_PRIORITY
INTO grade_archive
(grade_id,grade_value,grade_date)
SELECT
grade_id,grade_value,grade_date
FROM
grade
WHERE
grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
AND grade_id NOT IN
(
SELECT grade_id FROM peer_review
);
然后通过删除存档 table:
中的所有行来清理 grade
table
DELETE LOW_PRIORITY FROM grade WHERE grade_id IN (SELECT grade_id FROM grade_archive);
但是这些子选择对于大 tables 非常慢,我对结果感到紧张。寻找更好的方向。
我过去遇到过类似的问题,将一部分数据从大型活动 table 迁移到存档 table。我使用的方法(针对您的用例进行了修改)如下:
/* Set time for calculation basis */
SET@calc_time = NOW();
/* Create empty copy of grade table */
CREATE TABLE grade_temp LIKE grade;
/* Add rows you want to save from grade into temp table */
INSERT INTO grade_temp
SELECT
g.grade_id AS grade_id,
g.grade_value AS grade_value,
g.grade_date AS grade_date
FROM grade AS g
LEFT JOIN peer_review AS pr
ON g.grade_id = pr.grade_id
WHERE
/*
To keep the record it must either have an entry in peer review
or it is less than a month old
*/
pr.grade_id IS NOT NULL
OR g.grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/*
Switch new temp table for active table.
This happens really fast (it is just file name switching on the system).
*/
RENAME TABLE grade TO grade_old, grade_temp TO grade;
/*
You are now taking new records into new version of grade table
and free to do your much slower operations against the grade_old table
*/
/* Delete more recent rows */
DELETE FROM grade_old
WHERE grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Delete rows that exist in peer review */
DELETE FROM grade old
WHERE grade_id IN (
SELECT grade_id
FROM peer_review
WHERE grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH)
);
/*
As an alternate to the above action, you could also try deleting across join as shown below. Which is faster will likely depend upon number of records that are returned from that subquery shown above. You can try both out and see what works best
*/
DELETE go FROM grade_old AS go
INNER JOIN peer_review AS pr
ON go.grade_id = pr.grade_id
WHERE pr.grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Add all rows from grade_old to grade_archive */
INSERT INTO grade_archive
SELECT
grade_id,
grade_value,
grade_date
FROM grade_old;
/* Drop date_old table */
DROP TABLE date_old;
这里的关键是尽快获得新版本的等级 table,其中仅包含所需的行,然后在事后整理归档 table 中的内容.您不想对那个大小的 table 执行任何批量删除操作。这样可以将您评分 table 与这些归档操作相关的时间降至最低。
不过我要说的是,您的数据库架构似乎可以针对此类操作进行优化。例如,您可以在您的成绩 table 上设置一个同行评审标志,您可以使用它来进行更快的过滤,而不必在连接中进行过滤。我实际上是在质疑整个同行评审 table 的必要性,除非它与等级 table 存在多对一的关系(您的问题中似乎没有说明)。如果每个 grade_id 只有一个同行评审条目,我认为这些列应该被标准化为等级 table。这将大大简化此维护过程。
由于 NOT IN ( SELECT ... )
非常慢,使用 LEFT JOIN .. IS NULL
获得相同的效果:
SELECT g.grade_id, g.grade_value, g.grade_date
FROM grade AS g
LEFT JOIN peer_review AS p USING(grade_id)
WHERE g.grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
AND gi.grade_id IS NULL ;
不需要显式 tmp table。
我有三个table:
grade (grade_id, grade_value, grade_date) ~100M rows
grade_archive (grade_id, grade_value, grade_date) 0 rows
peer_review (grade_id, peer_review_value, peer_review_date) ~10M rows
我想移动所有从 table grade
到 grade_archive
超过一个月且不在 table peer_review
中的行。
tables 被积极使用,因此任何插入都必须是低优先级以避免在运行时中断现有和新进程。
预期 table 行完成后应如下所示:
grade ~10M rows
grade_archive ~90M rows
peer_review ~10M rows
我想它接近于:
INSERT
LOW_PRIORITY
INTO grade_archive
(grade_id,grade_value,grade_date)
SELECT
grade_id,grade_value,grade_date
FROM
grade
WHERE
grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
AND grade_id NOT IN
(
SELECT grade_id FROM peer_review
);
然后通过删除存档 table:
中的所有行来清理grade
table
DELETE LOW_PRIORITY FROM grade WHERE grade_id IN (SELECT grade_id FROM grade_archive);
但是这些子选择对于大 tables 非常慢,我对结果感到紧张。寻找更好的方向。
我过去遇到过类似的问题,将一部分数据从大型活动 table 迁移到存档 table。我使用的方法(针对您的用例进行了修改)如下:
/* Set time for calculation basis */
SET@calc_time = NOW();
/* Create empty copy of grade table */
CREATE TABLE grade_temp LIKE grade;
/* Add rows you want to save from grade into temp table */
INSERT INTO grade_temp
SELECT
g.grade_id AS grade_id,
g.grade_value AS grade_value,
g.grade_date AS grade_date
FROM grade AS g
LEFT JOIN peer_review AS pr
ON g.grade_id = pr.grade_id
WHERE
/*
To keep the record it must either have an entry in peer review
or it is less than a month old
*/
pr.grade_id IS NOT NULL
OR g.grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/*
Switch new temp table for active table.
This happens really fast (it is just file name switching on the system).
*/
RENAME TABLE grade TO grade_old, grade_temp TO grade;
/*
You are now taking new records into new version of grade table
and free to do your much slower operations against the grade_old table
*/
/* Delete more recent rows */
DELETE FROM grade_old
WHERE grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Delete rows that exist in peer review */
DELETE FROM grade old
WHERE grade_id IN (
SELECT grade_id
FROM peer_review
WHERE grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH)
);
/*
As an alternate to the above action, you could also try deleting across join as shown below. Which is faster will likely depend upon number of records that are returned from that subquery shown above. You can try both out and see what works best
*/
DELETE go FROM grade_old AS go
INNER JOIN peer_review AS pr
ON go.grade_id = pr.grade_id
WHERE pr.grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Add all rows from grade_old to grade_archive */
INSERT INTO grade_archive
SELECT
grade_id,
grade_value,
grade_date
FROM grade_old;
/* Drop date_old table */
DROP TABLE date_old;
这里的关键是尽快获得新版本的等级 table,其中仅包含所需的行,然后在事后整理归档 table 中的内容.您不想对那个大小的 table 执行任何批量删除操作。这样可以将您评分 table 与这些归档操作相关的时间降至最低。
不过我要说的是,您的数据库架构似乎可以针对此类操作进行优化。例如,您可以在您的成绩 table 上设置一个同行评审标志,您可以使用它来进行更快的过滤,而不必在连接中进行过滤。我实际上是在质疑整个同行评审 table 的必要性,除非它与等级 table 存在多对一的关系(您的问题中似乎没有说明)。如果每个 grade_id 只有一个同行评审条目,我认为这些列应该被标准化为等级 table。这将大大简化此维护过程。
由于 NOT IN ( SELECT ... )
非常慢,使用 LEFT JOIN .. IS NULL
获得相同的效果:
SELECT g.grade_id, g.grade_value, g.grade_date
FROM grade AS g
LEFT JOIN peer_review AS p USING(grade_id)
WHERE g.grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
AND gi.grade_id IS NULL ;
不需要显式 tmp table。