如何在没有大字节数组的情况下从按需生成的 InputStream 解析 ZIP 文件?

How to parse a ZIP file from by-demand-generated-InputStream without having a large byte-array?

背景

我一直在尝试弄清楚如何使用其中的流来处理有问题的 ZIP 文件。

原因:

  1. 一些 ZIP 文件不是来自文件路径。有些来自某些 Uri,有些甚至在另一个 ZIP 文件中。

  2. 有些ZIP文件打开时问题很大,所以连同上一点,框架提供的东西不能随便用。来自 APKPure 网站的 "XAPK" 文件的示例(示例 here)。

作为我搜索过的可能解决方案之一,我询问了通过 JNI 进行内存分配,以将整个 ZIP 文件保存在 RAM 中,同时使用 Apache 的 ZipFile class以各种方式处理 zip 文件,而不仅仅是通过文件路径。

问题

这样的东西似乎工作()很好,但它有一些问题:

  1. 您并不总是有可用的内存。
  2. 您不确定在不让应用程序崩溃的情况下允许分配的最大内存是多少
  3. 如果您不小心选择了过多的内存来分配,您将无法捕获它,并且应用程序会崩溃。它不像 Java,在那里你可以安全地使用 try-catch 并将你从 OOM 中拯救出来(如果我错了,你可以,请告诉我,因为了解 JNI 是一件非常好的事情) .

那么,假设您始终可以创建 InputStream(通过 Uri 或从现有 zip 文件中创建),您如何将其解析为 zip 文件?

我发现了什么

我制作了一个可以执行此操作的工作示例,方法是使用 Apache 的 ZipFile,并让它遍历 Zip 文件,就好像所有内容都在内存中一样。

每次它要求从某个位置读取一些字节时,我都会重新创建 inputStream。

它工作正常,但问题是我不能很好地优化它以最小化我重新创建 inputStream 的次数。我尝试至少缓存当前的 InputStream,如果足够好,则重新使用它(如果需要,从当前位置跳过),如果所需位置在当前位置之前,则重新创建 inputStream。遗憾的是它在某些情况下失败了(例如我上面提到的 XAPK 文件),因为它会导致 EOFException。

目前我只使用 Uri,但可以为从另一个 zip 文件中重新生成的 InputStream 完成类似的解决方案。

这是低效的解决方案(样本可用 here,包括低效的解决方案和我试图改进的解决方案),它似乎总是有效:

InefficientSeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var cachedSize: Long = -1L
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        getNewInputStream().use { inputStream ->
            inputStream.skip(position)
            //now we have an inputStream right on the needed position
            inputStream.readBytesIntoByteArray(buffer, wanted)
        }
        buf.put(buffer, 0, wanted)
        position += wanted
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

InefficientSeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

用法:

val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
    private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
        Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
        try {
            val startTime = System.currentTimeMillis()
            ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
                val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
                for (entry in zipFile.entries) {
                    val name = entry.name
                    val size = entry.size
                    entriesNamesAndSizes.add(Pair(name, size))
                    Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
                }
                val endTime = System.currentTimeMillis()
                Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
                return true
            }
        } catch (e: Throwable) {
            Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
            e.printStackTrace()
        }
        return false
    }

这是我改进它的尝试,但在某些情况下不起作用:

SeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var actualPosition: Long = 0L
    private var cachedSize: Long = -1L
    private var inputStream: InputStream? = null
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
        inputStream.closeSilently().also { inputStream = null }
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        var inputStream = this.inputStream
        //skipping to required position
        if (inputStream == null) {
            inputStream = getNewInputStream()
//            Log.d("AppLog", "getNewInputStream")
            inputStream.skip(position)
            this.inputStream = inputStream
        } else {
            if (actualPosition > position) {
                inputStream.close()
                actualPosition = 0L
                inputStream = getNewInputStream()
//                Log.d("AppLog", "getNewInputStream")
                this.inputStream = inputStream
            }
            inputStream.skip(position - actualPosition)
        }
        //now we have an inputStream right on the needed position
        inputStream.readBytesIntoByteArray(buffer, wanted)
        buf.put(buffer, 0, wanted)
        position += wanted
        actualPosition = position
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

SeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

问题

有什么方法可以改善吗?

可能通过尽可能少地重新创建 InputStream 来实现?

是否有更多可能的优化可以让它很好地解析 ZIP 文件?也许对数据进行一些缓存?

我问这个是因为与我找到的其他解决方案相比它似乎很慢,我认为这可能会有所帮助。

BufferedInputStreamskip() 方法并不总是跳过您指定的所有字节。在SeekableInputStreamByteChannel中更改以下代码

inputStream.skip(position - actualPosition)

var bytesToSkip = position - actualPosition
while (bytesToSkip > 0) {
    bytesToSkip -= inputStream.skip(bytesToSkip)
}

这应该有效。

关于提高效率,ZipFile 做的第一件事是缩放到文件末尾以获取中央目录 (CD)。有了 CD,ZipFile 知道 zip 文件的组成。 zip 条目的顺序应与文件的布局顺序相同。我会以相同的顺序阅读您想要的文件,以避免回溯。如果您不能保证读取顺序,那么多个输入流可能是有意义的。