如何在没有大字节数组的情况下从按需生成的 InputStream 解析 ZIP 文件?
How to parse a ZIP file from by-demand-generated-InputStream without having a large byte-array?
背景
我一直在尝试弄清楚如何使用其中的流来处理有问题的 ZIP 文件。
原因:
一些 ZIP 文件不是来自文件路径。有些来自某些 Uri,有些甚至在另一个 ZIP 文件中。
有些ZIP文件打开时问题很大,所以连同上一点,框架提供的东西不能随便用。来自 APKPure 网站的 "XAPK" 文件的示例(示例 here)。
作为我搜索过的可能解决方案之一,我询问了通过 JNI 进行内存分配,以将整个 ZIP 文件保存在 RAM 中,同时使用 Apache 的 ZipFile class以各种方式处理 zip 文件,而不仅仅是通过文件路径。
问题
这样的东西似乎工作()很好,但它有一些问题:
- 您并不总是有可用的内存。
- 您不确定在不让应用程序崩溃的情况下允许分配的最大内存是多少
- 如果您不小心选择了过多的内存来分配,您将无法捕获它,并且应用程序会崩溃。它不像 Java,在那里你可以安全地使用 try-catch 并将你从 OOM 中拯救出来(如果我错了,你可以,请告诉我,因为了解 JNI 是一件非常好的事情) .
那么,假设您始终可以创建 InputStream(通过 Uri 或从现有 zip 文件中创建),您如何将其解析为 zip 文件?
我发现了什么
我制作了一个可以执行此操作的工作示例,方法是使用 Apache 的 ZipFile,并让它遍历 Zip 文件,就好像所有内容都在内存中一样。
每次它要求从某个位置读取一些字节时,我都会重新创建 inputStream。
它工作正常,但问题是我不能很好地优化它以最小化我重新创建 inputStream 的次数。我尝试至少缓存当前的 InputStream,如果足够好,则重新使用它(如果需要,从当前位置跳过),如果所需位置在当前位置之前,则重新创建 inputStream。遗憾的是它在某些情况下失败了(例如我上面提到的 XAPK 文件),因为它会导致 EOFException。
目前我只使用 Uri,但可以为从另一个 zip 文件中重新生成的 InputStream 完成类似的解决方案。
这是低效的解决方案(样本可用 here,包括低效的解决方案和我试图改进的解决方案),它似乎总是有效:
InefficientSeekableInputStreamByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
private var position: Long = 0L
private var cachedSize: Long = -1L
private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
abstract fun getNewInputStream(): InputStream
override fun isOpen(): Boolean = true
override fun position(): Long = position
override fun position(newPosition: Long): SeekableByteChannel {
// Log.d("AppLog", "position $newPosition")
require(newPosition >= 0L) { "Position has to be positive" }
position = newPosition
return this
}
open fun calculateSize(): Long {
return getNewInputStream().use { inputStream: InputStream ->
if (inputStream is FileInputStream)
return inputStream.channel.size()
var bytesCount = 0L
while (true) {
val available = inputStream.available()
if (available == 0)
break
val skip = inputStream.skip(available.toLong())
if (skip < 0)
break
bytesCount += skip
}
bytesCount
}
}
final override fun size(): Long {
if (cachedSize < 0L)
cachedSize = calculateSize()
// Log.d("AppLog", "size $cachedSize")
return cachedSize
}
override fun close() {
}
override fun read(buf: ByteBuffer): Int {
var wanted: Int = buf.remaining()
// Log.d("AppLog", "read wanted:$wanted")
if (wanted <= 0)
return wanted
val possible = (calculateSize() - position).toInt()
if (possible <= 0)
return -1
if (wanted > possible)
wanted = possible
if (buffer.size < wanted)
buffer = ByteArray(wanted)
getNewInputStream().use { inputStream ->
inputStream.skip(position)
//now we have an inputStream right on the needed position
inputStream.readBytesIntoByteArray(buffer, wanted)
}
buf.put(buffer, 0, wanted)
position += wanted
return wanted
}
//not needed, because we don't store anything in memory
override fun truncate(size: Long): SeekableByteChannel = this
override fun write(src: ByteBuffer?): Int {
//not needed, we read only
throw NotImplementedError()
}
}
InefficientSeekableInUriByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
private val applicationContext = someContext.applicationContext
override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)
override fun getNewInputStream(): InputStream = BufferedInputStream(
applicationContext.contentResolver.openInputStream(uri)!!)
}
用法:
val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
try {
val startTime = System.currentTimeMillis()
ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
for (entry in zipFile.entries) {
val name = entry.name
val size = entry.size
entriesNamesAndSizes.add(Pair(name, size))
Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
}
val endTime = System.currentTimeMillis()
Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
return true
}
} catch (e: Throwable) {
Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
e.printStackTrace()
}
return false
}
这是我改进它的尝试,但在某些情况下不起作用:
SeekableInputStreamByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
private var position: Long = 0L
private var actualPosition: Long = 0L
private var cachedSize: Long = -1L
private var inputStream: InputStream? = null
private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
abstract fun getNewInputStream(): InputStream
override fun isOpen(): Boolean = true
override fun position(): Long = position
override fun position(newPosition: Long): SeekableByteChannel {
// Log.d("AppLog", "position $newPosition")
require(newPosition >= 0L) { "Position has to be positive" }
position = newPosition
return this
}
open fun calculateSize(): Long {
return getNewInputStream().use { inputStream: InputStream ->
if (inputStream is FileInputStream)
return inputStream.channel.size()
var bytesCount = 0L
while (true) {
val available = inputStream.available()
if (available == 0)
break
val skip = inputStream.skip(available.toLong())
if (skip < 0)
break
bytesCount += skip
}
bytesCount
}
}
final override fun size(): Long {
if (cachedSize < 0L)
cachedSize = calculateSize()
// Log.d("AppLog", "size $cachedSize")
return cachedSize
}
override fun close() {
inputStream.closeSilently().also { inputStream = null }
}
override fun read(buf: ByteBuffer): Int {
var wanted: Int = buf.remaining()
// Log.d("AppLog", "read wanted:$wanted")
if (wanted <= 0)
return wanted
val possible = (calculateSize() - position).toInt()
if (possible <= 0)
return -1
if (wanted > possible)
wanted = possible
if (buffer.size < wanted)
buffer = ByteArray(wanted)
var inputStream = this.inputStream
//skipping to required position
if (inputStream == null) {
inputStream = getNewInputStream()
// Log.d("AppLog", "getNewInputStream")
inputStream.skip(position)
this.inputStream = inputStream
} else {
if (actualPosition > position) {
inputStream.close()
actualPosition = 0L
inputStream = getNewInputStream()
// Log.d("AppLog", "getNewInputStream")
this.inputStream = inputStream
}
inputStream.skip(position - actualPosition)
}
//now we have an inputStream right on the needed position
inputStream.readBytesIntoByteArray(buffer, wanted)
buf.put(buffer, 0, wanted)
position += wanted
actualPosition = position
return wanted
}
//not needed, because we don't store anything in memory
override fun truncate(size: Long): SeekableByteChannel = this
override fun write(src: ByteBuffer?): Int {
//not needed, we read only
throw NotImplementedError()
}
}
SeekableInUriByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
private val applicationContext = someContext.applicationContext
override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)
override fun getNewInputStream(): InputStream = BufferedInputStream(
applicationContext.contentResolver.openInputStream(uri)!!)
}
问题
有什么方法可以改善吗?
可能通过尽可能少地重新创建 InputStream 来实现?
是否有更多可能的优化可以让它很好地解析 ZIP 文件?也许对数据进行一些缓存?
我问这个是因为与我找到的其他解决方案相比它似乎很慢,我认为这可能会有所帮助。
BufferedInputStream 的 skip()
方法并不总是跳过您指定的所有字节。在SeekableInputStreamByteChannel中更改以下代码
inputStream.skip(position - actualPosition)
至
var bytesToSkip = position - actualPosition
while (bytesToSkip > 0) {
bytesToSkip -= inputStream.skip(bytesToSkip)
}
这应该有效。
关于提高效率,ZipFile 做的第一件事是缩放到文件末尾以获取中央目录 (CD)。有了 CD,ZipFile 知道 zip 文件的组成。 zip 条目的顺序应与文件的布局顺序相同。我会以相同的顺序阅读您想要的文件,以避免回溯。如果您不能保证读取顺序,那么多个输入流可能是有意义的。
背景
我一直在尝试弄清楚如何使用其中的流来处理有问题的 ZIP 文件。
原因:
一些 ZIP 文件不是来自文件路径。有些来自某些 Uri,有些甚至在另一个 ZIP 文件中。
有些ZIP文件打开时问题很大,所以连同上一点,框架提供的东西不能随便用。来自 APKPure 网站的 "XAPK" 文件的示例(示例 here)。
作为我搜索过的可能解决方案之一,我询问了通过 JNI 进行内存分配,以将整个 ZIP 文件保存在 RAM 中,同时使用 Apache 的 ZipFile class以各种方式处理 zip 文件,而不仅仅是通过文件路径。
问题
这样的东西似乎工作(
- 您并不总是有可用的内存。
- 您不确定在不让应用程序崩溃的情况下允许分配的最大内存是多少
- 如果您不小心选择了过多的内存来分配,您将无法捕获它,并且应用程序会崩溃。它不像 Java,在那里你可以安全地使用 try-catch 并将你从 OOM 中拯救出来(如果我错了,你可以,请告诉我,因为了解 JNI 是一件非常好的事情) .
那么,假设您始终可以创建 InputStream(通过 Uri 或从现有 zip 文件中创建),您如何将其解析为 zip 文件?
我发现了什么
我制作了一个可以执行此操作的工作示例,方法是使用 Apache 的 ZipFile,并让它遍历 Zip 文件,就好像所有内容都在内存中一样。
每次它要求从某个位置读取一些字节时,我都会重新创建 inputStream。
它工作正常,但问题是我不能很好地优化它以最小化我重新创建 inputStream 的次数。我尝试至少缓存当前的 InputStream,如果足够好,则重新使用它(如果需要,从当前位置跳过),如果所需位置在当前位置之前,则重新创建 inputStream。遗憾的是它在某些情况下失败了(例如我上面提到的 XAPK 文件),因为它会导致 EOFException。
目前我只使用 Uri,但可以为从另一个 zip 文件中重新生成的 InputStream 完成类似的解决方案。
这是低效的解决方案(样本可用 here,包括低效的解决方案和我试图改进的解决方案),它似乎总是有效:
InefficientSeekableInputStreamByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
private var position: Long = 0L
private var cachedSize: Long = -1L
private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
abstract fun getNewInputStream(): InputStream
override fun isOpen(): Boolean = true
override fun position(): Long = position
override fun position(newPosition: Long): SeekableByteChannel {
// Log.d("AppLog", "position $newPosition")
require(newPosition >= 0L) { "Position has to be positive" }
position = newPosition
return this
}
open fun calculateSize(): Long {
return getNewInputStream().use { inputStream: InputStream ->
if (inputStream is FileInputStream)
return inputStream.channel.size()
var bytesCount = 0L
while (true) {
val available = inputStream.available()
if (available == 0)
break
val skip = inputStream.skip(available.toLong())
if (skip < 0)
break
bytesCount += skip
}
bytesCount
}
}
final override fun size(): Long {
if (cachedSize < 0L)
cachedSize = calculateSize()
// Log.d("AppLog", "size $cachedSize")
return cachedSize
}
override fun close() {
}
override fun read(buf: ByteBuffer): Int {
var wanted: Int = buf.remaining()
// Log.d("AppLog", "read wanted:$wanted")
if (wanted <= 0)
return wanted
val possible = (calculateSize() - position).toInt()
if (possible <= 0)
return -1
if (wanted > possible)
wanted = possible
if (buffer.size < wanted)
buffer = ByteArray(wanted)
getNewInputStream().use { inputStream ->
inputStream.skip(position)
//now we have an inputStream right on the needed position
inputStream.readBytesIntoByteArray(buffer, wanted)
}
buf.put(buffer, 0, wanted)
position += wanted
return wanted
}
//not needed, because we don't store anything in memory
override fun truncate(size: Long): SeekableByteChannel = this
override fun write(src: ByteBuffer?): Int {
//not needed, we read only
throw NotImplementedError()
}
}
InefficientSeekableInUriByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
private val applicationContext = someContext.applicationContext
override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)
override fun getNewInputStream(): InputStream = BufferedInputStream(
applicationContext.contentResolver.openInputStream(uri)!!)
}
用法:
val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
try {
val startTime = System.currentTimeMillis()
ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
for (entry in zipFile.entries) {
val name = entry.name
val size = entry.size
entriesNamesAndSizes.add(Pair(name, size))
Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
}
val endTime = System.currentTimeMillis()
Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
return true
}
} catch (e: Throwable) {
Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
e.printStackTrace()
}
return false
}
这是我改进它的尝试,但在某些情况下不起作用:
SeekableInputStreamByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
private var position: Long = 0L
private var actualPosition: Long = 0L
private var cachedSize: Long = -1L
private var inputStream: InputStream? = null
private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
abstract fun getNewInputStream(): InputStream
override fun isOpen(): Boolean = true
override fun position(): Long = position
override fun position(newPosition: Long): SeekableByteChannel {
// Log.d("AppLog", "position $newPosition")
require(newPosition >= 0L) { "Position has to be positive" }
position = newPosition
return this
}
open fun calculateSize(): Long {
return getNewInputStream().use { inputStream: InputStream ->
if (inputStream is FileInputStream)
return inputStream.channel.size()
var bytesCount = 0L
while (true) {
val available = inputStream.available()
if (available == 0)
break
val skip = inputStream.skip(available.toLong())
if (skip < 0)
break
bytesCount += skip
}
bytesCount
}
}
final override fun size(): Long {
if (cachedSize < 0L)
cachedSize = calculateSize()
// Log.d("AppLog", "size $cachedSize")
return cachedSize
}
override fun close() {
inputStream.closeSilently().also { inputStream = null }
}
override fun read(buf: ByteBuffer): Int {
var wanted: Int = buf.remaining()
// Log.d("AppLog", "read wanted:$wanted")
if (wanted <= 0)
return wanted
val possible = (calculateSize() - position).toInt()
if (possible <= 0)
return -1
if (wanted > possible)
wanted = possible
if (buffer.size < wanted)
buffer = ByteArray(wanted)
var inputStream = this.inputStream
//skipping to required position
if (inputStream == null) {
inputStream = getNewInputStream()
// Log.d("AppLog", "getNewInputStream")
inputStream.skip(position)
this.inputStream = inputStream
} else {
if (actualPosition > position) {
inputStream.close()
actualPosition = 0L
inputStream = getNewInputStream()
// Log.d("AppLog", "getNewInputStream")
this.inputStream = inputStream
}
inputStream.skip(position - actualPosition)
}
//now we have an inputStream right on the needed position
inputStream.readBytesIntoByteArray(buffer, wanted)
buf.put(buffer, 0, wanted)
position += wanted
actualPosition = position
return wanted
}
//not needed, because we don't store anything in memory
override fun truncate(size: Long): SeekableByteChannel = this
override fun write(src: ByteBuffer?): Int {
//not needed, we read only
throw NotImplementedError()
}
}
SeekableInUriByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
private val applicationContext = someContext.applicationContext
override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)
override fun getNewInputStream(): InputStream = BufferedInputStream(
applicationContext.contentResolver.openInputStream(uri)!!)
}
问题
有什么方法可以改善吗?
可能通过尽可能少地重新创建 InputStream 来实现?
是否有更多可能的优化可以让它很好地解析 ZIP 文件?也许对数据进行一些缓存?
我问这个是因为与我找到的其他解决方案相比它似乎很慢,我认为这可能会有所帮助。
BufferedInputStream 的 skip()
方法并不总是跳过您指定的所有字节。在SeekableInputStreamByteChannel中更改以下代码
inputStream.skip(position - actualPosition)
至
var bytesToSkip = position - actualPosition
while (bytesToSkip > 0) {
bytesToSkip -= inputStream.skip(bytesToSkip)
}
这应该有效。
关于提高效率,ZipFile 做的第一件事是缩放到文件末尾以获取中央目录 (CD)。有了 CD,ZipFile 知道 zip 文件的组成。 zip 条目的顺序应与文件的布局顺序相同。我会以相同的顺序阅读您想要的文件,以避免回溯。如果您不能保证读取顺序,那么多个输入流可能是有意义的。