从 java 中的多个文件中读取分散的数据

Question

我正在为 DNG/TIFF 个文件制作 reader/writer。由于通常有多种处理文件的选项（FileInputStream、FileChannel、RandomAccessFile），我想知道哪种策略适合我的需要。

一个DNG/TIFF文件由以下组成：

一些（5-20）个小块（几十到百字节）
很少 (1-3) 个大的连续图像数据块（最多 100 MiB）
几个（可能是 20-50 个）非常小的块（4-16 字节）

总文件大小从 15 MiB（压缩的 14 位原始数据）到大约 100 MiB（未压缩的浮点数据）不等。要处理的文件数为 50-400。

有两种使用模式：

从所有文件中读取所有元数据（图像数据除外）
从所有文件中读取所有图像数据

我目前正在使用 FileChannel 并执行 map() 以获得覆盖整个文件的 MappedByteBuffer。如果我只是对阅读元数据感兴趣，这似乎很浪费。另一个问题是释放映射内存：当我传递映射缓冲区的切片进行解析等时，底层 MappedByteBuffer 将不会被收集。

我现在决定使用几种 read() 方法复制较小的 FileChannel 块，并且只映射大的原始数据区域。缺点是读取单个值看起来异常复杂，因为没有readShort()之类的：

short readShort(long offset) throws IOException, InterruptedException {
    return read(offset, Short.BYTES).getShort();
}

ByteBuffer read(long offset, long byteCount) throws IOException, InterruptedException {
    ByteBuffer buffer = ByteBuffer.allocate(Math.toIntExact(byteCount));
    buffer.order(GenericTiffFileReader.this.byteOrder);
    GenericTiffFileReader.this.readInto(buffer, offset);
    return buffer;
}

private void readInto(ByteBuffer buffer, long startOffset)
        throws IOException, InterruptedException {

    long offset = startOffset;
    while (buffer.hasRemaining()) {
        int bytesRead = this.channel.read(buffer, offset);
        switch (bytesRead) {
        case 0:
            Thread.sleep(10);
            break;
        case -1:
            throw new EOFException("unexpected end of file");
        default:
            offset += bytesRead;
        }
    }
    buffer.flip();
}

RandomAccessFile 提供了像 readShort() 或 readFully() 这样有用的方法，但不能处理小端字节顺序。

那么，有没有一种惯用的方法来处理单个字节和大块的分散读取？内存映射整个 100 MiB 文件以仅读取几百个字节是浪费还是缓慢？

Answer 1

好的，我终于做了一些粗略的基准测试：

刷新所有读取缓存echo 3 > /proc/sys/vm/drop_caches
重复 8 次：从每个文件读取 1000 次 8 个字节（大约 20 个文件，从 20 MiB 到 1 GiB）。

文件大小的总和超出了我安装的系统内存。

方法 1，FileChannel 和临时 ByteBuffers:

private static long method1(Path file, long dummyUsage) throws IOException, Error {
    try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {

        for (int i = 0; i < 1000; i++) {
            ByteBuffer dst = ByteBuffer.allocate(8);

            if (channel.position(i * 10000).read(dst) != dst.capacity())
                throw new Error("partial read");
            dst.flip();
            dummyUsage += dst.order(ByteOrder.LITTLE_ENDIAN).getInt();
            dummyUsage += dst.order(ByteOrder.BIG_ENDIAN).getInt();
        }
    }
    return dummyUsage;
}

结果：

1. 3422 ms
2. 56 ms
3. 24 ms
4. 24 ms
5. 27 ms
6. 25 ms
7. 23 ms
8. 23 ms

方法二，MappedByteBuffer覆盖整个文件：

private static long method2(Path file, long dummyUsage) throws IOException {

    final MappedByteBuffer buffer;
    try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {
        buffer = channel.map(MapMode.READ_ONLY, 0L, Files.size(file));
    }
    for (int i = 0; i < 1000; i++) {
        dummyUsage += buffer.order(ByteOrder.LITTLE_ENDIAN).getInt(i * 10000);
        dummyUsage += buffer.order(ByteOrder.BIG_ENDIAN).getInt(i * 10000 + 4);
    }
    return dummyUsage;
}

结果：

1. 749 ms
2. 21 ms
3. 17 ms
4. 16 ms
5. 18 ms
6. 13 ms
7. 15 ms
8. 17 ms

方法三，RandomAccessFile：

private static long method3(Path file, long dummyUsage) throws IOException {

    try (RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r")) {
        for (int i = 0; i < 1000; i++) {

            raf.seek(i * 10000);
            dummyUsage += Integer.reverseBytes(raf.readInt());
            raf.seek(i * 10000 + 4);
            dummyUsage += raf.readInt();
        }
    }
    return dummyUsage;
}

结果：

1. 3479 ms
2. 104 ms
3. 81 ms
4. 84 ms
5. 78 ms
6. 81 ms
7. 81 ms
8. 81 ms

结论：MappedByteBuffer-方法占用更多的页面缓存内存（340 MB 而不是 140 MB），但在第一次和所有后续运行中的性能明显更好，而且开销似乎最低。作为奖励，这种方法提供了一个关于字节顺序、分散的小数据和大数据块的非常舒适的接口。 RandomAccessFile 表现最差。

回答我自己的问题：覆盖整个文件的 MappedByteBuffer 似乎是处理对大文件的随机访问而不会浪费内存的惯用且最快的方法。

从 java 中的多个文件中读取分散的数据

Read scattered data from multiple files in java

java

memory-mapped-files

randomaccessfile

filechannel