在以下情况下，哪个更好？ fread() 还是 mmap()？

Question

我想通过一个进程读取两个文件，第一个文件大约2G，第二个文件大约20M。它们看起来像这样：

1   1217907
1   1217908
1   1517737
1   2
1   3
1   4
1   5

现在打算把整个文件读入内存，然后用strtok_r()得到每个num的值，存入一个数据结构数组中。 4个进程可能几乎同时在同一台计算机上读取这两个文件。电脑是64位的，物理内存可能4G甚至更少。我的问题是，哪种方法更有效？ fread() 还是 mmap()？

下面是我读取整个文件的程序的关键部分（因为有人想看，但不知道是否与我的问题有关）：

typedef struct My_Edge
{
int first;
int second;
}Edge;

Edge *myEdge;

int Read_Whole_File()
{

fseek(wholeFile, 0, SEEK_END);

long int fileSize=ftell(wholeFile);

char *buffer=malloc(sizeof(char)*fileSize+1);

fseek(wholeFile, 0, SEEK_SET);

fread(buffer, 1, fileSize, wholeFile);

char *string_first;
char *string_second;
char *save_ptr;

int temp_first;
int temp_second;

string_first = strtok_r(buffer, " \t\n", &save_ptr);
int i=0;

int temp_edge_num;
Edge *temp_edge;

while (string_first != NULL)
{
    temp_first = atoi(string_first);

    string_second = strtok_r(NULL," \t\n",&save_ptr);

    temp_second = atoi(string_second);

    if(i>=my_edge_num)
    {
        temp_edge_num = i + EDGE_NUM_ADJUST;

        temp_edge = realloc(myEdge, sizeof(Edge)*temp_edge_num);
        if(temp_edge)
        {
            myEdge = temp_edge;
        }
        my_edge_num = temp_edge_num;

    }

    if((p_id[temp_first]==*partitionID)||(p_id[temp_second]==*partitionID))
    {
        myEdge[i].first=temp_first;
        myEdge[i].second=temp_second;
        i++;
    }

    string_first = strtok_r(NULL, " \t\n", &save_ptr);

}

return 0;
}

现在我正在尝试使用 mmap()，但是当我使用 strtok_r() 时它有 EXC_BAD_ACCESS tp 处理由 mmap():[=19 处理的 var =]

字符*缓冲区；

struct stat fileStat;

fstat(wholeFile, &fileStat);

buffer = mmap(NULL, fileStat.st_size, PROT_READ, MAP_SHARED, wholeFile, 0);

char *string_first;
char *string_second;
char *save_ptr;

int temp_first;
int temp_second;

string_first = strtok_r(buffer, " \t\n", &save_ptr);//EXC_BAD_ACCESS here, the content of buffer is correct.

Answer 1

My question is, which method is more efficient ? fread() or mmap() ?

首先来看看fread和mmap在linux上的工作原理：
fread:
假设我们使用 ext4 文件系统（未加密）， fread 使用一些内部缓冲区，如果其中没有数据，
它调用 read、read 执行 "system call" 一段时间后我们跳转到：
fs/read_write.c::vfs_read 经过更多的工作，我们到达 mm/filemap.c::generic_file_read_iter

并且在这个函数中我们填充 inode 页面缓存并读取到这个页面缓存数据。

所以我们做的和"mmap"基本一样。

不同之处在于 fread 情况下我们不直接使用页面，我们只从内核复制部分数据 inode 页面缓存到用户 space 缓冲区，
在 mmap 我们直接在程序中有页面缓存内存space。在 "inode page cache" 中没有页面时在 fread 中添加我们刚刚阅读了它，但在 mmap 中导致 "page fault"，只有在那之后我们才阅读它。

两种变体都使用 "read pages ahead" 策略。可能的差异可能在于 "cache" 政策，我们可以在 "mmap" 情况下使用 madvise 和 mmap.

的标志来控制它

所以我想答案是 "they are almost the same in terms of speed in sequence read case like yours"。

在以下情况下，哪个更好？ fread() 还是 mmap()？

In the following case, which one is better ? fread() or mmap()?

c

linux

memory

mmap

fread