MySQL加载忽略了一些记录
MySQL load ignores some records
我有这个 CSV file 大约有 16.916 条记录。当我将其加载到 MySQL 时,它仅检测到 15.945 条记录。
这就是 MySQL 所说的:
Records: 15945 Deleted: 0 Skipped: 0 Warnings: 0
谁能告诉我为什么 MySQL 会忽略一些记录,我该如何解决这个问题?
我像这样使用 LOAD 函数加载文件:
LOAD DATA LOCAL INFILE 'germany-filtered.csv'
INTO TABLE point_of_interest
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(osm_id,lat,lng,access,addr_housename,addr_housenumber,addr_interpolation,admin_level,aerialway,aeroway,amenity,area,barrier,bicycle,brand,bridge,boundary,building,capital,construction,covered,culvert,cutting,denomination,disused,ele,embankment,foot,generator_source,harbour,highway,historic,horse,intermittent,junction,landuse,layer,leisure,ship_lock,man_made,military,motorcar,name,osm_natural,office,oneway,operator,place,poi,population,power,power_source,public_transport,railway,ref,religion,route,service,shop,sport,surface,toll,tourism,tower_type,tunnel,water,waterway,wetland,width,wood);
这就是我使用的数据库模式:
CREATE TABLE point_of_interest (
`poi_id` int(10) unsigned NOT NULL auto_increment,
`lat` DECIMAL(10, 8) default NULL,
`lng` DECIMAL(11, 8) default NULL,
PRIMARY KEY (`poi_id`),
KEY `lat` (`lat`),
KEY `lng` (`lng`),
osm_id BIGINT,
access TEXT,
addr_housename TEXT,
addr_housenumber TEXT,
addr_interpolation TEXT,
admin_level TEXT,
aerialway TEXT,
aeroway TEXT,
amenity TEXT,
area TEXT,
barrier TEXT,
bicycle TEXT,
brand TEXT,
bridge TEXT,
boundary TEXT,
building TEXT,
capital TEXT,
construction TEXT,
covered TEXT,
culvert TEXT,
cutting TEXT,
denomination TEXT,
disused TEXT,
ele TEXT,
embankment TEXT,
foot TEXT,
generator_source TEXT,
harbour TEXT,
highway TEXT,
historic TEXT,
horse TEXT,
intermittent TEXT,
junction TEXT,
landuse TEXT,
layer TEXT,
leisure TEXT,
ship_lock TEXT,
man_made TEXT,
military TEXT,
motorcar TEXT,
name TEXT,
osm_natural TEXT,
office TEXT,
oneway TEXT,
operator TEXT,
place TEXT,
poi TEXT,
population TEXT,
power TEXT,
power_source TEXT,
public_transport TEXT,
railway TEXT,
ref TEXT,
religion TEXT,
route TEXT,
service TEXT,
shop TEXT,
sport TEXT,
surface TEXT,
toll TEXT,
tourism TEXT,
tower_type TEXT,
tunnel TEXT,
water TEXT,
waterway TEXT,
wetland TEXT,
width TEXT,
wood TEXT
) ENGINE=InnoDB;
更新:
我已经检查了第一条和最后一条记录,但它们都存在。也确实存在像这样有很多空值的记录:
1503898236,10.5271308,52.7468051,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
更新二:
这些是我在数据库中找到的记录:
4228380062,9.9386752,53.6135468,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Dammwild,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278483,9.9396935,53.5960729,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Onager,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4226772791,8.8394263,54.1354887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Familienlagune Perlebucht,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
似乎几乎所有 osm_id
以 4
开头的记录都丢失了。真奇怪。
试试看文件中是否有重复的 ID:
显示文件
# cat mycsv.csv
6991,10.4232704,49.4970160,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bauernhaus aus Seubersdorf,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228380062,9.9386752,53.6135468,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Dammwild,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278483,9.9396935,53.5960729,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Onager,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4226772791,8.8394263,54.1354887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Familienlagune Perlebucht,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
数行数
# wc -l mycsv.csv
6 mycsv.csv
删除重复的 ID 并重新计数
# cut -d',' -f1 mycsv.csv | sort | uniq | wc -l
5
我没有找到 MySQL 忽略某些记录的原因,所以我搜索了解决方法。有 2 种解决方案对我有用:
将 CSV 文件拆分为多个部分
split -l 10 file.csv
我发现如果我将 CSV 分成多个部分并将它们加载到 MySQL 中,它会识别每条记录。但是,这仅在文件非常小 (~10 records/file) 时对我有用。所以这个解决方案对我来说不可行。
将 CSV 转换为 MySQL 插入语句
bash 脚本的这一部分将 csv 文件转换为包含 INSERT INTO
子句的 SQL 文件:
cp file.csv inserts.sql
# replace empty CSV value with NULL
sed -r 's;^,|,$;NULL,;g
:l
s;,,;,NULL,;g
t l' -i inserts.sql
#replace " with '
sed -e ':a' -e 'N' -e '$!ba' -e 's/\"/\x27/g' -i inserts.sql
# enquote every value
sed 's/[^,][^,]*/"&"/g' -i inserts.sql
# replace ,, with ,NULL,NULL,
sed 's/,,/,NULL,NULL,/g' -i inserts.sql
# replace ,, with ,
sed 's/,,/,/g' -i inserts.sql
# add INSERT INTO table_name VALUES (NULL, before each line
# Note: The first value is NULL because its the primary key which is set from my table
sed 's/^/INSERT INTO table_name VALUES (NULL,/' -i inserts.sql
# add ); at the end of each line
sed 's/$/);/' -i inserts.sql
# replace ,); with );
sed 's/,);/);/g' -i inserts.sql
注意:我不保证此解决方案适用于所有 CSV 文件,因此请在使用前检查生成的 SQL 文件。
我有这个 CSV file 大约有 16.916 条记录。当我将其加载到 MySQL 时,它仅检测到 15.945 条记录。 这就是 MySQL 所说的:
Records: 15945 Deleted: 0 Skipped: 0 Warnings: 0
谁能告诉我为什么 MySQL 会忽略一些记录,我该如何解决这个问题?
我像这样使用 LOAD 函数加载文件:
LOAD DATA LOCAL INFILE 'germany-filtered.csv'
INTO TABLE point_of_interest
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(osm_id,lat,lng,access,addr_housename,addr_housenumber,addr_interpolation,admin_level,aerialway,aeroway,amenity,area,barrier,bicycle,brand,bridge,boundary,building,capital,construction,covered,culvert,cutting,denomination,disused,ele,embankment,foot,generator_source,harbour,highway,historic,horse,intermittent,junction,landuse,layer,leisure,ship_lock,man_made,military,motorcar,name,osm_natural,office,oneway,operator,place,poi,population,power,power_source,public_transport,railway,ref,religion,route,service,shop,sport,surface,toll,tourism,tower_type,tunnel,water,waterway,wetland,width,wood);
这就是我使用的数据库模式:
CREATE TABLE point_of_interest (
`poi_id` int(10) unsigned NOT NULL auto_increment,
`lat` DECIMAL(10, 8) default NULL,
`lng` DECIMAL(11, 8) default NULL,
PRIMARY KEY (`poi_id`),
KEY `lat` (`lat`),
KEY `lng` (`lng`),
osm_id BIGINT,
access TEXT,
addr_housename TEXT,
addr_housenumber TEXT,
addr_interpolation TEXT,
admin_level TEXT,
aerialway TEXT,
aeroway TEXT,
amenity TEXT,
area TEXT,
barrier TEXT,
bicycle TEXT,
brand TEXT,
bridge TEXT,
boundary TEXT,
building TEXT,
capital TEXT,
construction TEXT,
covered TEXT,
culvert TEXT,
cutting TEXT,
denomination TEXT,
disused TEXT,
ele TEXT,
embankment TEXT,
foot TEXT,
generator_source TEXT,
harbour TEXT,
highway TEXT,
historic TEXT,
horse TEXT,
intermittent TEXT,
junction TEXT,
landuse TEXT,
layer TEXT,
leisure TEXT,
ship_lock TEXT,
man_made TEXT,
military TEXT,
motorcar TEXT,
name TEXT,
osm_natural TEXT,
office TEXT,
oneway TEXT,
operator TEXT,
place TEXT,
poi TEXT,
population TEXT,
power TEXT,
power_source TEXT,
public_transport TEXT,
railway TEXT,
ref TEXT,
religion TEXT,
route TEXT,
service TEXT,
shop TEXT,
sport TEXT,
surface TEXT,
toll TEXT,
tourism TEXT,
tower_type TEXT,
tunnel TEXT,
water TEXT,
waterway TEXT,
wetland TEXT,
width TEXT,
wood TEXT
) ENGINE=InnoDB;
更新:
我已经检查了第一条和最后一条记录,但它们都存在。也确实存在像这样有很多空值的记录:
1503898236,10.5271308,52.7468051,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
更新二:
这些是我在数据库中找到的记录:
4228380062,9.9386752,53.6135468,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Dammwild,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278483,9.9396935,53.5960729,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Onager,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4226772791,8.8394263,54.1354887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Familienlagune Perlebucht,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
似乎几乎所有 osm_id
以 4
开头的记录都丢失了。真奇怪。
试试看文件中是否有重复的 ID:
显示文件
# cat mycsv.csv
6991,10.4232704,49.4970160,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bauernhaus aus Seubersdorf,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228380062,9.9386752,53.6135468,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Dammwild,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278483,9.9396935,53.5960729,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Onager,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4226772791,8.8394263,54.1354887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Familienlagune Perlebucht,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
4228278589,9.9391503,53.5960304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Kaninchen,,,,,,,,,,,,,,,,,,,,attraction,,,,,,,
数行数
# wc -l mycsv.csv
6 mycsv.csv
删除重复的 ID 并重新计数
# cut -d',' -f1 mycsv.csv | sort | uniq | wc -l
5
我没有找到 MySQL 忽略某些记录的原因,所以我搜索了解决方法。有 2 种解决方案对我有用:
将 CSV 文件拆分为多个部分
split -l 10 file.csv
我发现如果我将 CSV 分成多个部分并将它们加载到 MySQL 中,它会识别每条记录。但是,这仅在文件非常小 (~10 records/file) 时对我有用。所以这个解决方案对我来说不可行。
将 CSV 转换为 MySQL 插入语句
bash 脚本的这一部分将 csv 文件转换为包含 INSERT INTO
子句的 SQL 文件:
cp file.csv inserts.sql
# replace empty CSV value with NULL
sed -r 's;^,|,$;NULL,;g
:l
s;,,;,NULL,;g
t l' -i inserts.sql
#replace " with '
sed -e ':a' -e 'N' -e '$!ba' -e 's/\"/\x27/g' -i inserts.sql
# enquote every value
sed 's/[^,][^,]*/"&"/g' -i inserts.sql
# replace ,, with ,NULL,NULL,
sed 's/,,/,NULL,NULL,/g' -i inserts.sql
# replace ,, with ,
sed 's/,,/,/g' -i inserts.sql
# add INSERT INTO table_name VALUES (NULL, before each line
# Note: The first value is NULL because its the primary key which is set from my table
sed 's/^/INSERT INTO table_name VALUES (NULL,/' -i inserts.sql
# add ); at the end of each line
sed 's/$/);/' -i inserts.sql
# replace ,); with );
sed 's/,);/);/g' -i inserts.sql
注意:我不保证此解决方案适用于所有 CSV 文件,因此请在使用前检查生成的 SQL 文件。