如何优化用 C 读取的二进制文件（超过 1MB）？

Question

我需要逐字节读取两个 1MB+ 二进制文件，比较它们 - 如果它们不相等，打印出从不相等字节开始的接下来的 16 个字节。要求是它全部在 5 毫秒内运行。目前，如果不相等的位位于两个文件的末尾，我的程序将花费 19 毫秒。关于如何优化它有什么建议吗？

#include <stdio.h>  //printf
#include <unistd.h> //file open
#include <fcntl.h>  //file read
#include <stdlib.h> //exit()
#include <time.h>   //clock

#define SIZE 4096

void compare_binary(int fd1, int fd2)
{   
    int cmpflag = 0;
    int errorbytes = 1;
    char c1[SIZE], c2[SIZE];
    int numberofbytesread = 1;

    while(read(fd1, &c1, SIZE) == SIZE && read(fd2, &c2, SIZE) == SIZE && errorbytes < 17){
        for (int i=0 ; i < SIZE ; i++) {
            if (c1[i] != c2[i] && cmpflag == 0){
                printf("Bytes not matching at offset %d\n",numberofbytesread);
                cmpflag = 1;
            }
            if (cmpflag == 1){
                printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
                errorbytes++;
            }
            if (errorbytes > 16){
                break;
            }
            numberofbytesread++;
        }
    }
}

int main(int argc, char *argv[])
{
    int fd[2];
    if (argc < 3){
        printf("Check the number of arguments passed.\n");
        printf("Usage: ./compare_binary <binaryfile1> <binaryfile2>\n");
        exit(0);
    }
    if (!((access(argv[1], F_OK) == 0) && (access(argv[2], F_OK) == 0))){
        printf("Please check if the files passed in the argument exist.\n");
        exit(0);
    }

    fd[0] = open(argv[1], O_RDONLY);
    fd[1] = open(argv[2], O_RDONLY);

    if (fd[0]< 0 && fd[1] < 0){
        printf("Can't open file.\n");
        exit(0);
    }
    clock_t t;
    t = clock();
    compare_binary(fd[0], fd[1]);
    t = clock() - t;
    double time_taken = ((double)t)/(CLOCKS_PER_SEC/1000);
    printf("compare_binary took %f milliseconds to execute \n", time_taken);
}

基本上需要优化的方式来读取超过 1MB 的二进制文件，以便它们可以在 5 毫秒内完成。

Answer 1

首先，尝试读取更大的块。当您可以一次读取所有内容时，执行这么多读取调用毫无意义。现在使用 2 MB 的内存不是什么好事。磁盘 I/O 调用本来就很昂贵，它们的开销也很大，但是可以减少。

其次，尝试在每次迭代中比较整数（甚至 64 位长整数）而不是字节，这会显着减少您需要执行的循环次数。一旦发现不匹配，您仍然可以切换到逐字节执行。（当然，如果文件长度不是 4 或 8 的倍数，则需要一些额外的技巧）。

Answer 2

首先引起我注意的是这个

 if (cmpflag == 1){
            printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
            errorbytes++;
        }
        if (errorbytes > 16){
            break;
        }

你的cmpflag检查没用也许这东西做一点优化

  if (c1[i] != c2[i] && cmpflag == 0){
            printf("Bytes not matching at offset %d\n",numberofbytesread);
            printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
            errorbytes++;
            if (errorbytes > 16){
                break;
            }
        }

您可以使用内置函数进行数组比较，或者也可以增加缓冲区

如何优化用 C 读取的二进制文件（超过 1MB）？

How to optimize binary file(over 1MB) read in C?

c

binary

optimization

file

low-level