使用 MySQL 和 Java 进行排序功能优化

Question

我将使用 Hibernate 和 MySQL 在 Java 中生成简单的 CSV 文件报告。

我正在使用 Native SQL（因为查询太复杂了，使用 HQL 或 Criteria 查询是不可能的，而且这在这里并不重要）的一部分休眠以获取数据并使用任何 CSVWriter api 简单地写入数据（这在这里无关紧要。）

到目前为止一切都很好，但问题现在开始了。

要求：

报告大小可以是 5000K 到 15000K 条记录，25 个字段。
可以实时运行
我想要排序的一个报告列（比方说finalValue），它可以这样提取，(sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0))。

问题：

MySQL Indexing 不能用于 finalValue 列（如上所述），因为它是聚合函数的复杂组合。因此，如果执行带排序的查询（有或没有限制），则需要 40sec，否则 0.075sec.

解决方案： 这些是我能想到的一些解决方案，但每个都有一些局限性。

使用 java.util.TreeSet 排序：它会抛出 OutOfMemoryError，这很明显，因为如果我放置 15000K 重对象，堆 space 将被超过。
在 MySQL 中使用 limit 每次迭代查询和写入文件：这将花费很多时间，因为每个查询将花费相同的时间 50 秒 因为没有排序限制不能使用。

所以这里的主要问题是克服两个参数：内存和时间。我需要平衡两者。

有什么想法、建议吗？

注意： 我在这里没有给出任何代码片段，但这并不意味着问题细节不够。此处不需要代码。

Answer 1

我想你可以在这里使用流式传输 ResultSet。正如 ResultSet 部分中记录的 on this page。

以下是文档中的要点。

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, create a Statement instance in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
          java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.

There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.

The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.

If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.

因此，使用流式结果集，编写您的 order by 查询，然后开始将结果写入您的 CSV 文件。

这可能仍然没有解决排序问题，但我认为如果您不能预先生成该值并在其上放置索引，排序将需要一些时间。

但是，您可能会使用一些服务器配置变量来优化排序性能。

来自MySQL Order-By optimization page

我认为您可以设置 read_rnd_buffer_size 值，根据文档，该值可以：

Setting the variable to a large value can improve ORDER BY performance by a lot

另一个是 sort_buffer_size，文档对此有如下说明：

If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.

另一个可能有用的变量是 innodb_buffer_pool_size。这允许 innodb 在内存中保留尽可能多的 table 数据并避免一些磁盘寻道。

但是，所有这些变量都需要进行一些调整。一些反复试验和可能的某种基准测试才能正确。

还有其他一些建议MySQL Order-By optimization page。

Answer 2

使用临时 table 存储您的 select 结果，并在 finalValue 上建立索引。这将存储您的中间结果并为其编制索引。

CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
  SELECT ... -- your select

请注意，复杂的表达式需要您 SELECT 中的别名才能用作 CREATE TABLE SELECT 的一部分。我假设您的 SELECT 具有别名 finalValue（您提到的列）。

然后select按finalValue排序的临时table（会用到索引）

SELECT * FROM my_temp_table ORDER BY finalValue;

最后删除临时数据 table（如果需要，也可以重新使用它，但请记住，当客户端会话终止时，临时数据会自动删除）。

Answer 3

摘要table秒。（让我们查看更多详细信息以确保这是数据仓库类型的数据。）摘要 tables 会定期增加小计和计数。然后，当需要报告时，几乎可以直接从摘要中轻松获得数据 table，而不是扫描大量原始数据并进行汇总。

My blog on Summary Tables。让我们看看您的架构和报告查询；我们可以更详细地讨论这个问题。

使用 MySQL 和 Java 进行排序功能优化

Sorting funcationality Optimization using MySQL and Java

java

mysql

sorting

performance

large-data