如何在没有大字节数组的情况下从按需生成的 InputStream 解析 ZIP 文件？

Question

背景

我一直在尝试弄清楚如何使用其中的流来处理有问题的 ZIP 文件。

原因：

一些 ZIP 文件不是来自文件路径。有些来自某些 Uri，有些甚至在另一个 ZIP 文件中。
有些ZIP文件打开时问题很大，所以连同上一点，框架提供的东西不能随便用。来自 APKPure 网站的 "XAPK" 文件的示例（示例 here）。

作为我搜索过的可能解决方案之一，我询问了通过 JNI 进行内存分配，以将整个 ZIP 文件保存在 RAM 中，同时使用 Apache 的 ZipFile class以各种方式处理 zip 文件，而不仅仅是通过文件路径。

问题

这样的东西似乎工作()很好，但它有一些问题:

您并不总是有可用的内存。
您不确定在不让应用程序崩溃的情况下允许分配的最大内存是多少
如果您不小心选择了过多的内存来分配，您将无法捕获它，并且应用程序会崩溃。它不像 Java，在那里你可以安全地使用 try-catch 并将你从 OOM 中拯救出来（如果我错了，你可以，请告诉我，因为了解 JNI 是一件非常好的事情） .

那么，假设您始终可以创建 InputStream（通过 Uri 或从现有 zip 文件中创建），您如何将其解析为 zip 文件？

我发现了什么

我制作了一个可以执行此操作的工作示例，方法是使用 Apache 的 ZipFile，并让它遍历 Zip 文件，就好像所有内容都在内存中一样。

每次它要求从某个位置读取一些字节时，我都会重新创建 inputStream。

它工作正常，但问题是我不能很好地优化它以最小化我重新创建 inputStream 的次数。我尝试至少缓存当前的 InputStream，如果足够好，则重新使用它（如果需要，从当前位置跳过），如果所需位置在当前位置之前，则重新创建 inputStream。遗憾的是它在某些情况下失败了（例如我上面提到的 XAPK 文件），因为它会导致 EOFException。

目前我只使用 Uri，但可以为从另一个 zip 文件中重新生成的 InputStream 完成类似的解决方案。

这是低效的解决方案（样本可用 here，包括低效的解决方案和我试图改进的解决方案），它似乎总是有效：

InefficientSeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var cachedSize: Long = -1L
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        getNewInputStream().use { inputStream ->
            inputStream.skip(position)
            //now we have an inputStream right on the needed position
            inputStream.readBytesIntoByteArray(buffer, wanted)
        }
        buf.put(buffer, 0, wanted)
        position += wanted
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

InefficientSeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

用法：

val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
    private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
        Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
        try {
            val startTime = System.currentTimeMillis()
            ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
                val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
                for (entry in zipFile.entries) {
                    val name = entry.name
                    val size = entry.size
                    entriesNamesAndSizes.add(Pair(name, size))
                    Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
                }
                val endTime = System.currentTimeMillis()
                Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
                return true
            }
        } catch (e: Throwable) {
            Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
            e.printStackTrace()
        }
        return false
    }

这是我改进它的尝试，但在某些情况下不起作用：

SeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var actualPosition: Long = 0L
    private var cachedSize: Long = -1L
    private var inputStream: InputStream? = null
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
        inputStream.closeSilently().also { inputStream = null }
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        var inputStream = this.inputStream
        //skipping to required position
        if (inputStream == null) {
            inputStream = getNewInputStream()
//            Log.d("AppLog", "getNewInputStream")
            inputStream.skip(position)
            this.inputStream = inputStream
        } else {
            if (actualPosition > position) {
                inputStream.close()
                actualPosition = 0L
                inputStream = getNewInputStream()
//                Log.d("AppLog", "getNewInputStream")
                this.inputStream = inputStream
            }
            inputStream.skip(position - actualPosition)
        }
        //now we have an inputStream right on the needed position
        inputStream.readBytesIntoByteArray(buffer, wanted)
        buf.put(buffer, 0, wanted)
        position += wanted
        actualPosition = position
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

SeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

问题

有什么方法可以改善吗？

可能通过尽可能少地重新创建 InputStream 来实现？

是否有更多可能的优化可以让它很好地解析 ZIP 文件？也许对数据进行一些缓存？

我问这个是因为与我找到的其他解决方案相比它似乎很慢，我认为这可能会有所帮助。

Answer 1

BufferedInputStream 的 skip() 方法并不总是跳过您指定的所有字节。在SeekableInputStreamByteChannel中更改以下代码

inputStream.skip(position - actualPosition)

至

var bytesToSkip = position - actualPosition
while (bytesToSkip > 0) {
    bytesToSkip -= inputStream.skip(bytesToSkip)
}

这应该有效。

关于提高效率，ZipFile 做的第一件事是缩放到文件末尾以获取中央目录 (CD)。有了 CD，ZipFile 知道 zip 文件的组成。 zip 条目的顺序应与文件的布局顺序相同。我会以相同的顺序阅读您想要的文件，以避免回溯。如果您不能保证读取顺序，那么多个输入流可能是有意义的。

如何在没有大字节数组的情况下从按需生成的 InputStream 解析 ZIP 文件？

How to parse a ZIP file from by-demand-generated-InputStream without having a large byte-array?

zip

android

apache-commons

背景

问题

我发现了什么

问题