MySQL 插入内存​​非常慢 table

MySQL inserts very slow into MEMORY table

我正在尝试优化一个大型(68K 行)插入到暂存 table。我创建了 table 作为内存引擎 table,根本没有索引或外键。当我的 ETL 进程开始插入时,插入会执行,但速度非常慢;满载需要一个多小时。

这是显示 table 创建的 table 定义:

CREATE TABLE `pub_tair_germplasm` (
  `germplasm_id` int(12) DEFAULT NULL,
  `name` varchar(100) DEFAULT NULL,
  `original_name` varchar(100) DEFAULT NULL,
  `sp_growth_conditions` varchar(2048) DEFAULT NULL,
  `description` varchar(2048) DEFAULT NULL,
  `description_uc` varchar(2048) DEFAULT NULL,
  `is_mutant` varchar(1) DEFAULT NULL,
  `is_aneuploid` varchar(1) DEFAULT NULL,
  `ploidy` varchar(4) DEFAULT NULL,
  `species_variant_id` int(12) DEFAULT NULL,
  `taxon_id` int(12) DEFAULT NULL,
  `aneuploid_chromosome` int(10) DEFAULT NULL,
  `date_entered` date DEFAULT NULL,
  `date_last_modified` date DEFAULT NULL,
  `tair_object_id` bigint(19) DEFAULT NULL,
  `is_obsolete` varchar(1) DEFAULT NULL,
  `tair_object_type_id` int(12) DEFAULT NULL,
  `germplasm_type` varchar(20) DEFAULT NULL
) ENGINE=MEMORY DEFAULT CHARSET=latin1

插入内容如下:

INSERT INTO pub_tair_germplasm(
   germplasm_id,
   name,
   original_name,
   sp_growth_conditions,
   description,
   description_uc,
   is_mutant,
   is_aneuploid,
   ploidy,
   species_variant_id,
   taxon_id,
   aneuploid_chromosome,
   date_entered,
   date_last_modified,
   tair_object_id,
   is_obsolete,
   tair_object_type_id,
   germplasm_type)
VALUES (
   $germplasm_id,
   $name,
   $original_name,
   $sp_growth_conditions,
   $description,
   $description_uc,
   CASE $is_mutant WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   CASE $is_aneuploid WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   $ploidy,
   $species_variant_id, 
   $taxon_id,
   $aneuploid_chromosome,
   $date_entered,
   $date_last_modified,
   $tair_object_id,
   $is_obsolete,
   $tair_object_type_id,
   $type)

这是通过 Clover/ETL 完成的,通常插入速度非常快,使用 JDBC 批量大小为 5000 的批处理。值变量是 CloverETL 变量引用。 Oracle 上的类似插入需要几秒钟才能进入常规 table。这全部在单个事务中完成,在插入所有行之前不提交(应用程序要求)。

虽然插入 运行ning,但顶部显示两个 CPU 的利用率均为 0.3%。

编辑:

对于下一个测试 运行,我将最大堆 table 大小增加到 1GB,足以容纳整个 table:

mysql> select @@max_heap_table_size;
+-----------------------+
| @@max_heap_table_size |
+-----------------------+
|             999999488 |
+-----------------------+

开始时的进程列表:

mysql> SHOW FULL PROCESSLIST;
+----+------+-----------+-------+---------+------+-------+-----------------------+
| Id | User | Host      | db    | Command | Time | State | Info                  |
+----+------+-----------+-------+---------+------+-------+-----------------------+
|  3 | root | localhost | mysql | Query   |    0 | NULL  | SHOW FULL PROCESSLIST |
+----+------+-----------+-------+---------+------+-------+-----------------------+
1 row in set (0.00 sec)

运行期间的进程列表:

mysql> SHOW FULL PROCESSLIST;
+----+---------+--------------------------------------------+-------+---------+------+-------+-----------------------+
| Id | User    | Host                                       | db    | Command | Time | State | Info                  |
+----+---------+--------------------------------------------+-------+---------+------+-------+-----------------------+
|  4 | pubuser | c-67-188-135-136.hsd1.ca.comcast.net:55928 | pub   | Sleep   |    0 |       | NULL                  |
|  5 | root    | localhost                                  | mysql | Query   |    0 | NULL  | SHOW FULL PROCESSLIST |
+----+---------+--------------------------------------------+-------+---------+------+-------+-----------------------+
2 rows in set (0.00 sec)

我启用了通用日志文件;它显示发出的 CloverETL 环境设置命令,然后进入系列插入:

150528 20:22:54     4 Connect   pubuser@c-67-188-135-136.hsd1.ca.comcast.net on pub
                    4 Query     /* mysql-connector-java-5.1.20 ( Revision: tonci.grgin@oracle.com-20111003110438-qfydx066wsbydkbw ) */SHOW VARIABLES WHERE Variable_name ='langua
ge' OR Variable_name = 'net_write_timeout' OR Variable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' OR Variable_name = 'character_set_client' OR Variable_name 
= 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name = 'character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name = 'transaction_isola
tion' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_cas
e_table_names' OR Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_nam
e = 'query_cache_size' OR Variable_name = 'init_connect'
                    4 Query     /* mysql-connector-java-5.1.20 ( Revision: tonci.grgin@oracle.com-20111003110438-qfydx066wsbydkbw ) */SELECT @@session.auto_increment_increment
                    4 Query     SHOW COLLATION
150528 20:22:55     4 Query     SET NAMES latin1
                    4 Query     SET character_set_results = NULL
                    4 Query     SET autocommit=1
                    4 Query     SET sql_mode='STRICT_TRANS_TABLES'
                    4 Query     SET autocommit=0
                    4 Query     SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
150528 20:23:08     4 Query     INSERT INTO pub_tair_germplasm(
   germplasm_id,
   name,
   original_name,
   sp_growth_conditions,
   description,
   description_uc,
   is_mutant,
   is_aneuploid,
   ploidy,
   species_variant_id,
   taxon_id,
   aneuploid_chromosome,
   date_entered,
   date_last_modified,
   tair_object_id,
   is_obsolete,
   tair_object_type_id,
   germplasm_type)
VALUES (
   500689369,
   'CS2000002',
   'CS2000002',
   'none',
   'Sequence-indexed T-DNA insertion line; from the GABI-Kat project (German Plant Genomics Program - Koelner Arabidopsis T-DNA lines); generated using flanking sequence tags (F
STs) in the Columbia (Col-0) background; genomic DNA was isolated from T1 plants; plant sequences adjacent to T-DNA borders were amplified by adapter-ligation PCR; automated pur
ification and sequencing of PCR product were conducted followed by computational trimming of the resulting sequence files; for details, see the GABI-Kat web site: http://www.gab
i-kat.de; this is a T4 generation single-plant line potentially homozygous for the insertion. May be segregating for phenotypes that are not linked to the insertion; may have ad
ditional insertions potentially segregating.',
   'SEQUENCE-INDEXED T-DNA INSERTION LINE; FROM THE GABI-KAT PROJECT (GERMAN PLANT GENOMICS PROGRAM - KOELNER ARABIDOPSIS T-DNA LINES); GENERATED USING FLANKING SEQUENCE TAGS (F
STS) IN THE COLUMBIA (COL-0) BACKGROUND; GENOMIC DNA WAS ISOLATED FROM T1 PLANTS; PLANT SEQUENCES ADJACENT TO T-DNA BORDERS WERE AMPLIFIED BY ADAPTER-LIGATION PCR; AUTOMATED PUR
IFICATION AND SEQUENCING OF PCR PRODUCT WERE CONDUCTED FOLLOWED BY COMPUTATIONAL TRIMMING OF THE RESULTING SEQUENCE FILES; FOR DETAILS, SEE THE GABI-KAT WEB SITE: HTTP://WWW.GAB
I-KAT.DE; THIS IS A T4 GENERATION SINGLE-PLANT LINE POTENTIALLY HOMOZYGOUS FOR THE INSERTION. MAY BE SEGREGATING FOR PHENOTYPES THAT ARE NOT LINKED TO THE INSERTION; MAY HAVE AD
DITIONAL INSERTIONS POTENTIALLY SEGREGATING.',
   CASE null WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   CASE 'F' WHEN 'F' THEN 'n' WHEN 'T' THEN 'y' ELSE 'y' END,
   '2',
   null, 
   1,
   null,
   '2015-01-06 10:49:21',
   '2015-01-06 10:40:55',
   6530679980,
   'F',
   200016,
   'individual_line')

问题依旧

我怀疑您试图使用 MEMORY 存储引擎超过 table 允许的最大大小。但我不明白为什么 MySQL 没有返回错误,或者 Clover/ETL 没有返回错误或超时。

我建议您从

收集输出
SHOW FULL PROCESSLIST

查看当前会话的状态。也许 INSERT 语句是 "hanging"?

您也可以临时启用通用日志

SET GLOBAL general_log = 1

然后尝试加载。该日志文件可以非常快地变得非常大,因此您要确保将其设置回 0 以禁用它。

正如我在评论中指出的那样,我正在计算 table 的大小超过 415MB。这是基于使用 MEMORY 引擎(固定长度行),行大小为 6407 字节(假设字符列为单字节字符集),不包括空指示符的行开销。

摘自 MySQL 参考手册,第 15.3 节 MEMORY 存储引擎

The maximum size of MEMORY tables is limited by the max_heap_table_size system variable, which has a default value of 16MB. To enforce different size limits for MEMORY tables, change the value of this variable. The value in effect for CREATE TABLE, or a subsequent ALTER TABLE or TRUNCATE TABLE, is the value used for the life of the table. A server restart also sets the maximum size of existing MEMORY tables to the global max_heap_table_size value. You can set the size for individual tables as described later in this section.

参考:https://dev.mysql.com/doc/refman/5.6/en/memory-storage-engine.html


此外,我们注意到 MEMORY 存储引擎不支持事务。

如果您要求在加载数据时没有其他会话看到数据,您可以考虑使用 InnoDB 或 MyISAM,并加载到 不同的 table名称,然后在加载完成后立即重命名 table。

或者,使用 InnoDB 存储引擎和 hughjass 事务。将所有 INSERT 操作作为单个事务的一部分执行。 (但是,我不喜欢这种方法;我害怕产生大量的回滚。)

或者,使用MyISAM 存储引擎。在 table 上加锁,执行所有 INSERT 操作,然后释放锁。 (但是,任何其他试图引用 table 的语句都将 "hang" 等待锁被释放。)

嗯,我不知道具体是什么问题,但是将制表符分隔版本的数据上传到 mysql 服务器并执行此操作:

LOAD DATA LOCAL INFILE '/tmp/pub_tair_grm_insert.csv' INTO TABLE pub_tair_germplasm;
Query OK, 68932 rows affected, 65535 warnings (1.26 sec)
Records: 68932  Deleted: 0  Skipped: 0  Warnings: 6

显然是答案,不管是什么问题。 Clover/ETL 在 JDBC 批处理中一定有一些东西正在显着减慢插入速度。如果有机会,我会调查一下,但现在 LOAD 给了我我需要的东西。

从广义上讲,5000 条语句的批大小被认为太大了。实际最佳 JDBC 批量大小取决于多个标准。但是,50 到 100 之间的数字应该没问题。您当然可以尝试更大的数字并尝试找到适合您的数字。我肯定会玩一些数字并检查性能如何随着不同的数字而变化。

关于内存要求(这也是使用批处理大小时发挥作用的因素之一),似乎基于 CREATE 语句,一行的最大大小为 6418 字节。因此,如果您有大约 68k 行(以此类行的最大大小计算),则此任务所需的最大内存约为 450GB。但是,也就是说,甚至不到您定义的 max_heap_table_size 的一半。所以如果 table 为空应该没有问题。此外,如果达到 MEMORY table 的限制,CloverETL 图将因适当的 SQLException (java.sql.SQLException: The table <YOUR_MEMORY_TABLE>' is full) 而失败。但似乎并非如此。

我想到的另一个选项(再次与上述注释相关)是您实际上没有足够的可用内存来将数据存储在内存中,因此数据在磁盘上交换。这可能会澄清性能下降。但是,如果您 运行 超出了所有可用的虚拟内存(RAM 和交换区),您很可能 运行 会陷入另一个具有不同症状的问题 – 所以情况并非如此。

也许,如果以上都没有帮助,您可以与我们分享 CloverETL 图执行的日志(如果可能,以 DEBUG 日志级别执行),以便我们查看是否有任何可疑之处。除此之外,CloverETL 的商业版本还附带 MySQLDataWriter,它使用 MySQL 本地客户端,可能会解决与 JDBC 相关的问题。不过,我会从找到合适的批量大小开始。