使用 Hibernate Search 索引巨大 table

Indexing huge table with Hibernate Search

我正在尝试将 Hibernate Search 添加到我的项目中以提高搜索性能,但我在索引巨大的 table 时遇到了问题。 我已经添加了 Hibernate Search 依赖项,并且我有一个简单的 servlet,我可以在其中触发索引过程:

    FullTextEntityManager ftem = Search.getFullTextEntityManager(em);
    try {
        ftem
        .createIndexer(MyEntity.class)
        .batchSizeToLoadObjects(25)
        .cacheMode(CacheMode.NORMAL)
        .threadsToLoadObjects(5)
        .startAndWait();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

在我的 persistance.xml:

    <property name="hibernate.show_sql" value="false" />
    <property name="hibernate.dialect" value="org.hibernate.dialect.MySQL5InnoDBDialect" />
    <property name="hibernate.archive.autodetection" value="class" />
    <property name="hibernate.search.default.directory_provider" value="filesystem" />
    <property name="hibernate.search.default.indexBase" value="/var/lucene/indexes" />

问题是 MyEntity table 有大约 2500 万行,大约 30 秒后我收到内存不足错误消息:

2015-07-28 21:16:50,168 INFO  [stdout] (default task-60) Building index

2015-07-28 21:16:55,180 INFO  [org.hibernate.search.impl.SimpleIndexingProgressMonitor] (Hibernate Search: identifierloader-1) HSEARCH000027: Going to reindex 22593085 entities
2015-07-28 21:19:47,186 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 2) WFLYCTL0013: Operation ("read-children-resources") failed - address: ([]): java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-28 21:19:58,506 WARN  [org.jboss.jca.core.connectionmanager.listener.TxConnectionListener] (Hibernate Search: identifierloader-1) IJ000305: Connection error occured: org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@15a020a3[state=NORMAL managed connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@446189fe connection handles=1 lastReturned=1438110947536 lastValidated=1438108373971 lastCheckedOut=1438111010224 trackByTx=true pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool@3fb3ab95 mcp=SemaphoreArrayListManagedConnectionPool@496e4f29[pool=MyProjectApiDS] xaResource=LocalXAResourceImpl@4f676ce7[connectionListener=15a020a3 connectionManager=798378ab warned=false currentXid=< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffc0a8010b:537a5b28:55b7cad0:167, node_name=1, branch_uid=0:ffffc0a8010b:537a5b28:55b7cad0:169, subordinatenodename=null, eis_name=java:/MyProjectApiDS > productName=MySQL productVersion=5.6.25-log jndiName=java:/MyProjectApiDS] txSync=null]: javax.resource.spi.ResourceAdapterInternalException: Unexpected error
    at org.jboss.jca.adapters.jdbc.BaseWrapperManagedConnection.broadcastConnectionError(BaseWrapperManagedConnection.java:699)
    at org.jboss.jca.adapters.jdbc.BaseWrapperManagedConnection.connectionError(BaseWrapperManagedConnection.java:665)
    at org.jboss.jca.adapters.jdbc.WrappedConnection.checkException(WrappedConnection.java:1669)
    at org.jboss.jca.adapters.jdbc.WrappedStatement.checkException(WrappedStatement.java:1267)
    at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery(WrappedPreparedStatement.java:467)
    at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:82)
    at org.hibernate.loader.Loader.getResultSet(Loader.java:2066)
    at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1863)
    at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1839)
    at org.hibernate.loader.Loader.scroll(Loader.java:2627)
    at org.hibernate.loader.criteria.CriteriaLoader.scroll(CriteriaLoader.java:121)
    at org.hibernate.internal.StatelessSessionImpl.scroll(StatelessSessionImpl.java:682)
    at org.hibernate.internal.CriteriaImpl.scroll(CriteriaImpl.java:394)
    at org.hibernate.search.batchindexing.impl.IdentifierProducer.loadAllIdentifiers(IdentifierProducer.java:146)
    at org.hibernate.search.batchindexing.impl.IdentifierProducer.inTransactionWrapper(IdentifierProducer.java:111)
    at org.hibernate.search.batchindexing.impl.IdentifierProducer.run(IdentifierProducer.java:95)
    at org.hibernate.search.batchindexing.impl.OptionallyWrapInJTATransaction.runWithErrorHandler(OptionallyWrapInJTATransaction.java:97)
    at org.hibernate.search.batchindexing.impl.ErrorHandledRunnable.run(ErrorHandledRunnable.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-28 21:19:58,514 ERROR [org.jboss.remoting.remote.connection] (XNIO-1 I/O-1) JBREM000200: Remote connection failed: java.io.IOException: Istniejące połączenie zostało gwałtownie zamknięte przez zdalnego hosta
2015-07-28 21:19:58,531 INFO  [org.jboss.as.server.deployment.scanner] (DeploymentScanner-threads - 2) WFLYDS0019: Deployment mysql-connector-java-5.1.34-bin.jar was previously deployed by this scanner but has been removed from the server deployment list by another management tool. Marker file C:\servers\wildfly-9.0.0.Final\standalone\deployments\mysql-connector-java-5.1.34-bin.jar.undeployed is being added to record this fact.
2015-07-28 21:19:58,620 WARN  [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (Hibernate Search: identifierloader-1) SQL Error: 0, SQLState: null
2015-07-28 21:19:58,621 ERROR [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (Hibernate Search: identifierloader-1) Error
2015-07-28 21:19:58,622 ERROR [org.hibernate.search.exception.impl.LogErrorHandler] (Hibernate Search: identifierloader-1) HSEARCH000058: HSEARCH000116: Unexpected error during MassIndexer operation: org.hibernate.exception.GenericJDBCException: could not extract ResultSet
    at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:54)
    at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:126)
    at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:112)
    at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:91)
    at org.hibernate.loader.Loader.getResultSet(Loader.java:2066)
    at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1863)
    at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1839)
    at org.hibernate.loader.Loader.scroll(Loader.java:2627)
    at org.hibernate.loader.criteria.CriteriaLoader.scroll(CriteriaLoader.java:121)
    at org.hibernate.internal.StatelessSessionImpl.scroll(StatelessSessionImpl.java:682)
    at org.hibernate.internal.CriteriaImpl.scroll(CriteriaImpl.java:394)
    at org.hibernate.search.batchindexing.impl.IdentifierProducer.loadAllIdentifiers(IdentifierProducer.java:146)
    at org.hibernate.search.batchindexing.impl.IdentifierProducer.inTransactionWrapper(IdentifierProducer.java:111)
    at org.hibernate.search.batchindexing.impl.IdentifierProducer.run(IdentifierProducer.java:95)
    at org.hibernate.search.batchindexing.impl.OptionallyWrapInJTATransaction.runWithErrorHandler(OptionallyWrapInJTATransaction.java:97)
    at org.hibernate.search.batchindexing.impl.ErrorHandledRunnable.run(ErrorHandledRunnable.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Error
    at org.jboss.jca.adapters.jdbc.WrappedConnection.checkException(WrappedConnection.java:1677)
    at org.jboss.jca.adapters.jdbc.WrappedStatement.checkException(WrappedStatement.java:1267)
    at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery(WrappedPreparedStatement.java:467)
    at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:82)
    ... 15 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-28 21:19:58,667 INFO  [org.hibernate.search.impl.SimpleIndexingProgressMonitor] (default task-60) HSEARCH000028: Reindexed 22593085 entities
2015-07-28 21:19:58,673 WARN  [com.arjuna.ats.jta] (Hibernate Search: identifierloader-1) ARJUNA016031: XAOnePhaseResource.rollback for < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffc0a8010b:537a5b28:55b7cad0:167, node_name=1, branch_uid=0:ffffc0a8010b:537a5b28:55b7cad0:169, subordinatenodename=null, eis_name=java:/MyProjectApiDS > failed with exception: org.jboss.jca.core.spi.transaction.local.LocalXAException: IJ001160: Could not rollback local transaction
    at org.jboss.jca.core.tx.jbossts.LocalXAResourceImpl.rollback(LocalXAResourceImpl.java:253)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAOnePhaseResource.rollback(XAOnePhaseResource.java:205)
    at com.arjuna.ats.internal.arjuna.abstractrecords.LastResourceRecord.topLevelAbort(LastResourceRecord.java:126)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2993)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2972)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1675)
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:127)
    at com.arjuna.ats.arjuna.AtomicAction.abort(AtomicAction.java:186)
    at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.rollbackAndDisassociate(TransactionImple.java:1282)
    at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.rollback(BaseTransaction.java:143)
    at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.rollback(BaseTransactionManagerDelegate.java:114)
    at org.hibernate.search.batchindexing.impl.OptionallyWrapInJTATransaction.cleanUpOnError(OptionallyWrapInJTATransaction.java:123)
    at org.hibernate.search.batchindexing.impl.ErrorHandledRunnable.run(ErrorHandledRunnable.java:54)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.jboss.jca.core.spi.transaction.local.LocalResourceException: No operations allowed after connection closed.
    at org.jboss.jca.adapters.jdbc.local.LocalManagedConnection.rollback(LocalManagedConnection.java:139)
    at org.jboss.jca.core.tx.jbossts.LocalXAResourceImpl.rollback(LocalXAResourceImpl.java:248)
    ... 15 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
    at com.mysql.jdbc.Util.getInstance(Util.java:360)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
    at com.mysql.jdbc.ConnectionImpl.throwConnectionClosedException(ConnectionImpl.java:1232)
    at com.mysql.jdbc.ConnectionImpl.checkClosed(ConnectionImpl.java:1225)
    at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4568)
    at org.jboss.jca.adapters.jdbc.local.LocalManagedConnection.rollback(LocalManagedConnection.java:132)
    ... 16 more

所以问题是,如何自动索引巨大的 tables?

根据您提供的堆栈跟踪,我猜问题出在您正在使用的 transaction 超时。增加配置中的超时设置并重试,但不推荐,因为增加默认超时将应用于整个应用程序中使用的transaction

如果增加超时对您有所帮助,那么您应该尝试一些其他方法,例如 ScrollMode 或批处理。

考虑this post,希望这会有所帮助。

您使用的是哪个版本的 Hibernate Search。如果您使用的是最新的 5.4 版本,您实际上可以仅为索引配置事务超时。像这样:

fullTextSession
 .createIndexer( User.class )
 .batchSizeToLoadObjects( 25 )
 .cacheMode( CacheMode.NORMAL )
 .threadsToLoadObjects( 12 )
 .idFetchSize( 150 )
 .transactionTimeout( 1800 )
 .startAndWait();

如果可以,我建议使用最新版本。

MySQL 的 JDBC 驱动程序开发人员做出了一些糟糕的决定;您需要通过将 JDBC 提取大小设置为 Integer.MIN_VALUE.[=12= 来强制它不要尝试将所有数据库加载到内存中,而是像 Hibernate 要求的那样实际使用惰性分页]

ftem
    .createIndexer(MyEntity.class)
    .batchSizeToLoadObjects(25)
    .cacheMode(CacheMode.NORMAL)
    .idFetchSize(Integer.MIN_VALUE) // Important on MySQL!
    .transactionTimeout(timeout) //also useful
    .threadsToLoadObjects(5)
    .startAndWait();

我一直对 startAndWait() 有疑问。最新版本很好,但仍有一些问题。对于非常大的数据集,有时将事务超时设置得高是不可行的。

您可以尝试的一种方法是分批获取结果。它不具有相同的完整性,因此您可能会错过新记录,但它不会超时。

代码看起来像这样:

int offset = 0;
int batchSize = 500;
boolean indexComplete = false;
while (!indexComplete) {
    utx.begin();
    FullTextEntityManager fullTextEntityManager = org.hibernate.search.jpa.Search.getFullTextEntityManager(em);
    TypedQuery<User> query = fullTextEntityManager.createQuery("SELECT u FROM User u", User.class);
    query.setFirstResult(offset);
    query.setMaxResults(batchSize);

    LOGGER.info("Indexing {} users from offset {}", batchSize, offset);
    List<User> results = query.getResultList();
    if (results == null || results.isEmpty()) {
        indexComplete = true;
    } else {
        offset += results.size();
        for (User user : results) {
            fullTextEntityManager.index(user);
        }
    }
    utx.commit();
}
LOGGER.info("Indexed {} objects", offset);