如何找出导致此插入失败的错误数据
How to find out bad data causing this insert to fail
我有一个包含 8000 万条记录的数据库 (Postgres 9.3.5),下面的 insert
查询失败:
ERROR: invalid input syntax for integer: ""
INSERT INTO DISCOGS.TRACK_DURATION
SELECT
track_id,
duration,
hours_as_seconds + minutes_as_seconds + seconds as total_seconds
FROM (
select
track_id,
duration,
CASE
WHEN duration like '%:%:%' THEN (split_part(duration, ':', 1))::bigint * 60 * 60
ELSE 0
END as hours_as_seconds,
CASE
WHEN duration like '%:%:%' THEN (split_part(duration, ':', 2))::bigint * 60
WHEN duration like '%:%' THEN (split_part(duration, ':', 1))::bigint * 60
ELSE 0
END as minutes_as_seconds,
CASE
WHEN duration like '%:%:%' THEN (split_part(duration, ':', 3))::bigint
WHEN duration like '%:%' THEN (split_part(duration, ':', 2))::bigint
ELSE 0
END as seconds
from discogs.track t1
where release_id < 10000000
and t1.duration!='' and t1.duration is not null
and t1.position!=''
) as s1
我可以使用 where release_id
来限制检查记录的数量,并且使用较低的值它很好,所以它是错误的数据,但是有这么多记录我如何找到问题数据。请注意,我已经过滤掉持续时间为空字符串的值,并且我还发现了一些包含错误数据的记录(例如 %%%%),我已经更改但它仍然失败。
我会使用正则表达式搜索格式错误的持续时间,如:
create table duration (
d varchar(20)
);
insert into duration (d) values ('12:34:56');
insert into duration (d) values ('34:56');
insert into duration (d) values ('15::'); -- bad one
insert into duration (d) values (':34:56'); -- bad one
insert into duration (d) values (':34:'); -- bad one
insert into duration (d) values ('12:34:'); -- bad one
insert into duration (d) values ('34:'); -- bad one
insert into duration (d) values (':56'); -- bad one
select *
from duration
where d not similar to '([0-9]+:)?[0-9]+:[0-9]+'
结果:
d
------
15::
:34:56
:34:
12:34:
34:
:56
在您的情况下,查询应如下所示:
select track_id, duration
from discogs.track
where duration not similar to '([0-9]+:)?[0-9]+:[0-9]+';
我有一个包含 8000 万条记录的数据库 (Postgres 9.3.5),下面的 insert
查询失败:
ERROR: invalid input syntax for integer: ""
INSERT INTO DISCOGS.TRACK_DURATION
SELECT
track_id,
duration,
hours_as_seconds + minutes_as_seconds + seconds as total_seconds
FROM (
select
track_id,
duration,
CASE
WHEN duration like '%:%:%' THEN (split_part(duration, ':', 1))::bigint * 60 * 60
ELSE 0
END as hours_as_seconds,
CASE
WHEN duration like '%:%:%' THEN (split_part(duration, ':', 2))::bigint * 60
WHEN duration like '%:%' THEN (split_part(duration, ':', 1))::bigint * 60
ELSE 0
END as minutes_as_seconds,
CASE
WHEN duration like '%:%:%' THEN (split_part(duration, ':', 3))::bigint
WHEN duration like '%:%' THEN (split_part(duration, ':', 2))::bigint
ELSE 0
END as seconds
from discogs.track t1
where release_id < 10000000
and t1.duration!='' and t1.duration is not null
and t1.position!=''
) as s1
我可以使用 where release_id
来限制检查记录的数量,并且使用较低的值它很好,所以它是错误的数据,但是有这么多记录我如何找到问题数据。请注意,我已经过滤掉持续时间为空字符串的值,并且我还发现了一些包含错误数据的记录(例如 %%%%),我已经更改但它仍然失败。
我会使用正则表达式搜索格式错误的持续时间,如:
create table duration (
d varchar(20)
);
insert into duration (d) values ('12:34:56');
insert into duration (d) values ('34:56');
insert into duration (d) values ('15::'); -- bad one
insert into duration (d) values (':34:56'); -- bad one
insert into duration (d) values (':34:'); -- bad one
insert into duration (d) values ('12:34:'); -- bad one
insert into duration (d) values ('34:'); -- bad one
insert into duration (d) values (':56'); -- bad one
select *
from duration
where d not similar to '([0-9]+:)?[0-9]+:[0-9]+'
结果:
d
------
15::
:34:56
:34:
12:34:
34:
:56
在您的情况下,查询应如下所示:
select track_id, duration
from discogs.track
where duration not similar to '([0-9]+:)?[0-9]+:[0-9]+';