如何从 RAM 中完全解析压缩文件？

Question

背景

我需要解析一些不同类型的 zip 文件（为了某种目的获取一些内部文件内容，包括获取它们的名称）。

有些文件无法通过文件路径访问，因为 Android 有 Uri 可以访问它们，而且有时 zip 文件位于另一个 zip 文件中。随着使用 SAF 的推动，在某些情况下使用文件路径的可能性就更小了。

为此，我们主要有2种处理方式：ZipFile class and ZipInputStream class.

问题

当我们有一个文件路径时，ZipFile 是一个完美的解决方案。它在速度方面也非常有效。

但是，对于其余情况，ZipInputStream 可能会遇到问题，例如 this one，它有一个有问题的 zip 文件，并导致此异常：

  java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor
        at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:321)
        at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:124)

我试过的

唯一始终有效的解决方案是将文件复制到其他地方，在那里您可以使用 ZipFile 解析它，但这效率低下并且需要您有可用存储空间，并在您需要时删除文件完成它。

因此，我发现 Apache 有一个很好的纯 Java 库 (here) 来解析 Zip 文件，并且出于某种原因它的 InputStream 解决方案（称为 "ZipArchiveInputStream") 似乎比原生 ZipInputStream class.

更高效

与我们在本机框架中拥有的相比，该库提供了更多的灵活性。例如，我可以将整个 zip 文件加载到字节数组中，并让库照常处理它，这甚至适用于我提到的有问题的 Zip 文件：

org.apache.commons.compress.archivers.zip.ZipFile(SeekableInMemoryByteChannel(byteArray)).use { zipFile ->
    for (entry in zipFile.entries) {
      val name = entry.name
      ... // use the zipFile like you do with native framework

gradle依赖性：

// http://commons.apache.org/proper/commons-compress/ https://mvnrepository.com/artifact/org.apache.commons/commons-compress
implementation 'org.apache.commons:commons-compress:1.20'

遗憾的是，这并不总是可行的，因为它取决于让堆内存保存整个 zip 文件，而在 Android 上它变得更加有限，因为堆大小可能相对较小 (堆可能是 100MB，而文件是 200MB）。与可以设置巨大堆内存的 PC 相比，Android 它根本不灵活。

因此，我搜索了一个具有 JNI 的解决方案，将整个 ZIP 文件加载到那里的字节数组中，而不是进入堆（至少不完全）。这可能是一个更好的解决方法，因为如果 ZIP 可以适合设备的 RAM 而不是堆，它可以防止我达到 OOM，同时也不需要额外的文件。

我发现 this library called "larray" 看起来很有前途，但不幸的是，当我尝试使用它时，它崩溃了，因为它的要求包括具有完整的 JVM，这意味着不适合 Android。

编辑：看到我找不到任何库和任何内置 class，我尝试自己使用 JNI。可悲的是，我对它很生疏，我查看了很久以前创建的一个旧存储库，用于对位图 (here) 执行一些操作。这就是我想出的：

原生-lib.cpp

#include <jni.h>
#include <android/log.h>
#include <cstdio>
#include <android/bitmap.h>
#include <cstring>
#include <unistd.h>

class JniBytesArray {
public:
    uint32_t *_storedData;

    JniBytesArray() {
        _storedData = NULL;
    }
};

extern "C" {
JNIEXPORT jobject JNICALL Java_com_lb_myapplication_JniByteArrayHolder_allocate(
        JNIEnv *env, jobject obj, jlong size) {
    auto *jniBytesArray = new JniBytesArray();
    auto *array = new uint32_t[size];
    for (int i = 0; i < size; ++i)
        array[i] = 0;
    jniBytesArray->_storedData = array;
    return env->NewDirectByteBuffer(jniBytesArray, 0);
}
}

JniByteArrayHolder.kt

class JniByteArrayHolder {
    external fun allocate(size: Long): ByteBuffer

    companion object {
        init {
            System.loadLibrary("native-lib")
        }
    }
}

class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
        thread {
            printMemStats()
            val jniByteArrayHolder = JniByteArrayHolder()
            val byteBuffer = jniByteArrayHolder.allocate(1L * 1024L)
            printMemStats()
        }
    }

    fun printMemStats() {
        val memoryInfo = ActivityManager.MemoryInfo()
        (getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager).getMemoryInfo(memoryInfo)
        val nativeHeapSize = memoryInfo.totalMem
        val nativeHeapFreeSize = memoryInfo.availMem
        val usedMemInBytes = nativeHeapSize - nativeHeapFreeSize
        val usedMemInPercentage = usedMemInBytes * 100 / nativeHeapSize
        Log.d("AppLog", "total:${Formatter.formatFileSize(this, nativeHeapSize)} " +
                "free:${Formatter.formatFileSize(this, nativeHeapFreeSize)} " +
                "used:${Formatter.formatFileSize(this, usedMemInBytes)} ($usedMemInPercentage%)")
    }

这似乎不对，因为如果我尝试使用 jniByteArrayHolder.allocate(1L * 1024L * 1024L * 1024L) 创建一个 1GB 字节数组，它会崩溃，没有任何异常或错误日志。

问题

是否可以将 JNI 用于 Apache 的库，以便它处理包含在 JNI "world" 中的 ZIP 文件内容？
如果可以，我该怎么做？有没有如何做的样本？它有 class 吗？还是我必须自己实施？如果是这样，你能展示一下它是如何在 JNI 中完成的吗？
如果不行，还有什么方法可以做到？也许可以替代 Apache 的功能？
JNI的解决方案，怎么不行呢？我怎样才能有效地将流中的字节复制到 JNI 字节数组中（我猜是通过缓冲区）？

Answer 1

你可以偷LWJGL的native memory management functions。它是 BSD3 许可的，因此您只需在某处提及您正在使用它的代码。

第 1 步：给定 InputStream is 和文件大小 ZIP_SIZE，将流插入由 LWJGL 的 org.lwjgl.system.MemoryUtil 助手 class 创建的直接字节缓冲区：

ByteBuffer bb = MemoryUtil.memAlloc(ZIP_SIZE);
byte[] buf = new byte[4096]; // Play with the buffer size to see what works best
int read = 0;
while ((read = is.read(buf)) != -1) {
  bb.put(buf, 0, read);
}

第 2 步：将 ByteBuffer 包裹在 ByteChannel 中。摘自 this gist。您可能想去掉书写部分。

package io.github.ncruces.utils;

import java.nio.ByteBuffer;
import java.nio.channels.NonWritableChannelException;
import java.nio.channels.SeekableByteChannel;

import static java.lang.Math.min;

public final class ByteBufferChannel implements SeekableByteChannel {
    private final ByteBuffer buf;

    public ByteBufferChannel(ByteBuffer buffer) {
        if (buffer == null) throw new NullPointerException();
        buf = buffer;
    }

    @Override
    public synchronized int read(ByteBuffer dst) {
        if (buf.remaining() == 0) return -1;

        int count = min(dst.remaining(), buf.remaining());
        if (count > 0) {
            ByteBuffer tmp = buf.slice();
            tmp.limit(count);
            dst.put(tmp);
            buf.position(buf.position() + count);
        }
        return count;
    }

    @Override
    public synchronized int write(ByteBuffer src) {
        if (buf.isReadOnly()) throw new NonWritableChannelException();

        int count = min(src.remaining(), buf.remaining());
        if (count > 0) {
            ByteBuffer tmp = src.slice();
            tmp.limit(count);
            buf.put(tmp);
            src.position(src.position() + count);
        }
        return count;
    }

    @Override
    public synchronized long position() {
        return buf.position();
    }

    @Override
    public synchronized ByteBufferChannel position(long newPosition) {
        if ((newPosition | Integer.MAX_VALUE - newPosition) < 0) throw new IllegalArgumentException();
        buf.position((int)newPosition);
        return this;
    }

    @Override
    public synchronized long size() { return buf.limit(); }

    @Override
    public synchronized ByteBufferChannel truncate(long size) {
        if ((size | Integer.MAX_VALUE - size) < 0) throw new IllegalArgumentException();
        int limit = buf.limit();
        if (limit > size) buf.limit((int)size);
        return this;
    }

    @Override
    public boolean isOpen() { return true; }

    @Override
    public void close() {}
}

第 3 步：像以前一样使用 ZipFile：

ZipFile zf = new ZipFile(ByteBufferChannel(bb);
for (ZipEntry ze : zf) {
    ...
}

第 4 步：手动释放本机缓冲区（最好在 finally 块中）：

MemoryUtil.memFree(bb);

Answer 2

我查看了您发布的 JNI 代码并进行了一些更改。主要是为 NewDirectByteBuffer 定义大小参数并使用 malloc()。

这是分配 800mb 后的日志输出：

D/AppLog: total:1.57 GB free:1.03 GB used:541 MB (34%)
D/AppLog: total:1.57 GB free:247 MB used:1.32 GB (84%)

下面是分配后缓冲区的样子。如您所见，调试器报告了 800mb 的限制，这正是我们所期望的。

我的 C 很生疏，所以我确信还有一些工作要做。我已将代码更新为更加健壮并允许释放内存。

原生-lib.cpp

extern "C" {
static jbyteArray *_holdBuffer = NULL;
static jobject _directBuffer = NULL;
/*
    This routine is not re-entrant and can handle only one buffer at a time. If a buffer is
    allocated then it must be released before the next one is allocated.
 */
JNIEXPORT
jobject JNICALL Java_com_example_zipfileinmemoryjni_JniByteArrayHolder_allocate(
        JNIEnv *env, jobject obj, jlong size) {
    if (_holdBuffer != NULL || _directBuffer != NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Call to JNI allocate() before freeBuffer()");
        return NULL;
    }

    // Max size for a direct buffer is the max of a jint even though NewDirectByteBuffer takes a
    // long. Clamp max size as follows:
    if (size > SIZE_T_MAX || size > INT_MAX || size <= 0) {
        jlong maxSize = SIZE_T_MAX < INT_MAX ? SIZE_T_MAX : INT_MAX;
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Native memory allocation request must be >0 and <= %lld but was %lld.\n",
                            maxSize, size);
        return NULL;
    }

    jbyteArray *array = (jbyteArray *) malloc(static_cast<size_t>(size));
    if (array == NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Failed to allocate %lld bytes of native memory.\n",
                            size);
        return NULL;
    }

    jobject directBuffer = env->NewDirectByteBuffer(array, size);
    if (directBuffer == NULL) {
        free(array);
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Failed to create direct buffer of size %lld.\n",
                            size);
        return NULL;
    }
    // memset() is not really needed but we call it here to force Android to count
    // the consumed memory in the stats since it only seems to "count" dirty pages. (?)
    memset(array, 0xFF, static_cast<size_t>(size));
    _holdBuffer = array;

    // Get a global reference to the direct buffer so Java isn't tempted to GC it.
    _directBuffer = env->NewGlobalRef(directBuffer);
    return directBuffer;
}

JNIEXPORT void JNICALL Java_com_example_zipfileinmemoryjni_JniByteArrayHolder_freeBuffer(
        JNIEnv *env, jobject obj, jobject directBuffer) {

    if (_directBuffer == NULL || _holdBuffer == NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Attempt to free unallocated buffer.");
        return;
    }

    jbyteArray *bufferLoc = (jbyteArray *) env->GetDirectBufferAddress(directBuffer);
    if (bufferLoc == NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Failed to retrieve direct buffer location associated with ByteBuffer.");
        return;
    }

    if (bufferLoc != _holdBuffer) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "DirectBuffer does not match that allocated.");
        return;
    }

    // Free the malloc'ed buffer and the global reference. Java can not GC the direct buffer.
    free(bufferLoc);
    env->DeleteGlobalRef(_directBuffer);
    _holdBuffer = NULL;
    _directBuffer = NULL;
}
}

我也更新了数组持有者：

class JniByteArrayHolder {
    external fun allocate(size: Long): ByteBuffer
    external fun freeBuffer(byteBuffer: ByteBuffer)

    companion object {
        init {
            System.loadLibrary("native-lib")
        }
    }
}

我可以确认此代码连同 Botje 提供的 ByteBufferChannel class 适用于 API 24 之前的 Android 版本。SeekableByteChannel 接口是在 API 24 中引入的，ZipFile 实用程序需要它。

可以分配的最大缓冲区大小是一个jint 的大小，这是由于JNI 的限制。可以容纳更大的数据（如果可用），但需要多个缓冲区和处理它们的方法。

这是示例应用程序的主要 activity。早期版本总是假定 InputStream 读取缓冲区总是已满，但在尝试将其放入 ByteBuffer 时出错。这是固定的。

MainActivity.kt

class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
    }

    fun onClick(view: View) {
        button.isEnabled = false
        status.text = getString(R.string.running)

        thread {
            printMemStats("Before buffer allocation:")
            var bufferSize = 0L
            // testzipfile.zip is not part of the project but any zip can be uploaded through the
            // device file manager or adb to test.
            val fileToRead = "$filesDir/testzipfile.zip"
            val inStream =
                if (File(fileToRead).exists()) {
                    FileInputStream(fileToRead).apply {
                        bufferSize = getFileSize(this)
                        close()
                    }
                    FileInputStream(fileToRead)
                } else {
                    // If testzipfile.zip doesn't exist, we will just look at this one which
                    // is part of the APK.
                    resources.openRawResource(R.raw.appapk).apply {
                        bufferSize = getFileSize(this)
                        close()
                    }
                    resources.openRawResource(R.raw.appapk)
                }
            // Allocate the buffer in native memory (off-heap).
            val jniByteArrayHolder = JniByteArrayHolder()
            val byteBuffer =
                if (bufferSize != 0L) {
                    jniByteArrayHolder.allocate(bufferSize)?.apply {
                        printMemStats("After buffer allocation")
                    }
                } else {
                    null
                }

            if (byteBuffer == null) {
                Log.d("Applog", "Failed to allocate $bufferSize bytes of native memory.")
            } else {
                Log.d("Applog", "Allocated ${Formatter.formatFileSize(this, bufferSize)} buffer.")
                val inBytes = ByteArray(4096)
                Log.d("Applog", "Starting buffered read...")
                while (inStream.available() > 0) {
                    byteBuffer.put(inBytes, 0, inStream.read(inBytes))
                }
                inStream.close()
                byteBuffer.flip()
                ZipFile(ByteBufferChannel(byteBuffer)).use {
                    Log.d("Applog", "Starting Zip file name dump...")
                    for (entry in it.entries) {
                        Log.d("Applog", "Zip name: ${entry.name}")
                        val zis = it.getInputStream(entry)
                        while (zis.available() > 0) {
                            zis.read(inBytes)
                        }
                    }
                }
                printMemStats("Before buffer release:")
                jniByteArrayHolder.freeBuffer(byteBuffer)
                printMemStats("After buffer release:")
            }
            runOnUiThread {
                status.text = getString(R.string.idle)
                button.isEnabled = true
                Log.d("Applog", "Done!")
            }
        }
    }

    /*
        This function is a little misleading since it does not reflect the true status of memory.
        After native buffer allocation, it waits until the memory is used before counting is as
        used. After release, it doesn't seem to count the memory as released until garbage
        collection. (My observations only.) Also, see the comment for memset() in native-lib.cpp
        which is a member of this project.
    */
    private fun printMemStats(desc: String? = null) {
        val memoryInfo = ActivityManager.MemoryInfo()
        (getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager).getMemoryInfo(memoryInfo)
        val nativeHeapSize = memoryInfo.totalMem
        val nativeHeapFreeSize = memoryInfo.availMem
        val usedMemInBytes = nativeHeapSize - nativeHeapFreeSize
        val usedMemInPercentage = usedMemInBytes * 100 / nativeHeapSize
        val sDesc = desc?.run { "$this:\n" }
        Log.d(
            "AppLog", "$sDesc total:${Formatter.formatFileSize(this, nativeHeapSize)} " +
                    "free:${Formatter.formatFileSize(this, nativeHeapFreeSize)} " +
                    "used:${Formatter.formatFileSize(this, usedMemInBytes)} ($usedMemInPercentage%)"
        )
    }

    // Not a great way to do this but not the object of the demo.
    private fun getFileSize(inStream: InputStream): Long {
        var bufferSize = 0L
        while (inStream.available() > 0) {
            val toSkip = inStream.available().toLong()
            inStream.skip(toSkip)
            bufferSize += toSkip
        }
        return bufferSize
    }
}

示例 GitHub 存储库是 here。

如何从 RAM 中完全解析压缩文件？

How to parse a zipped file completely from RAM?

memory

java-native-interface

zip

android

apache-commons

背景

问题

我试过的

问题