数据流 GZIP TextIO ZipException:长度或距离符号太多

Dataflow GZIP TextIO ZipException: too many length or distance symbols

对大量压缩文本文件(1000 多个文件,大小在 100MB 到 1.5GB 之间)使用 TextIO.Read 转换,有时会出现以下错误:

java.util.zip.ZipException: too many length or distance symbols at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.PushbackInputStream.read(PushbackInputStream.java:186) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)

网上搜索同样的ZipException,结果是这样reply:

Zip file errors often happen when the hot deployer attempts to deploy an application before it is fully copied to the deploy directory. This is fairly common if it takes several seconds to copy the file. The solution is to copy the file to a temporary directory on the same disk partition as the application server, and then move the file to the deploy directory.

有其他人 运行 遇到过类似的异常吗?或者无论如何我们可以解决这个问题?

查看 code that produces the error message 这似乎是 zlib 库(由 JDK 使用)不支持您拥有的 gzip 文件格式的问题。

看起来是 zlib 中的以下错误:Codes for reserved symbols are rejected even if unused

不幸的是,除了建议使用其他实用程序生成这些压缩文件外,我们可能无能为力。

如果您可以生成一个小示例 gzip 文件,我们可以使用它来重现该问题,我们或许可以看看是否有可能以某种方式解决问题,但我不会依赖它来成功。

这个问题可能有点老了,但这是我昨天 Google 搜索这个错误的第一个结果:

HIVE_CURSOR_ERROR: 长度或距离符号太多

按照这里的提示,我开始意识到我搞砸了我试图处理的文件的 gzip 构造。我有两个进程将 gzip 数据写入同一个输出文件,因此输出文件已损坏。修复写入唯一文件的过程解决了这个问题。我认为这个答案可能会节省一些时间。

我在 Spring 启动时遇到了这个错误。我有一个将使用图书馆项目的主要项目。我在主项目中使用 Spring 执行器。一旦我删除了 spring 执行器,它就开始工作了。