对文件使用 getc
Usage of getc with a file
要打印文件的内容,可以使用 getc
:
int ch;
FILE *file = fopen("file.txt", "r");
while ((ch = getc(file)) != EOF) {
// do something
}
getc
函数的效率如何?也就是说,它实际上执行操作系统调用或需要花费大量时间的操作的频率是多少?例如,假设我有一个 10TB 的文件——调用此函数数万亿次会不会是一种糟糕的数据获取方式?
That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time?
您可以查看 GNU libc or of musl-libc to study the implementation of getc
. You should also study the implementation of cat(1) and wc(1). Both are open source. And GNU as
(part of GNU binutils) is a free software (internally used by most compilations by GCC) 的源代码,它实际上运行得非常快并且可以进行文本操作(将汇编程序文本输入转换为二进制目标文件)。 您可以从其源代码中获取灵感。
您可以使用 setvbuf(3)
更改缓冲区大小
您可能希望使用 fread(3) or fgets(3) 一次读取几个字节,可能是几千字节的数据片段
你也可以使用调试器gdb(1) or the strace(1) utility to find out when syscalls(2)都用了哪些。
For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?
很可能不是,因为内核的 page cache。
您应该对您的程序进行概要分析和基准测试,以找出其瓶颈。
大多数时候不会 getc
。请参阅 time(7) and gprof(1) (and compile all your code with GCC 调用为 gcc -O3 -pg -Wall
)
如果原始输入性能在您的程序中至关重要,请考虑直接且明智地使用 open(2), read(2), mmap(2), madvise(2), readahead(2), posix_fadvise(2), close(2). Most of these syscalls could fail, see errno(3)。
您还可以更改文件系统(例如,从 Ext4 到 XFS,请参阅 ext4(5) and xfs(5)), buy better SSD disks or more physical RAM, or play with mount(2) 选项,以提高性能。
另见 /proc
伪文件系统(所以 proc(5)...); and 回答。
您可能想使用像 sqlite or PostGreSQL
这样的数据库
您的程序可以在运行时生成 C 代码(如 manydl.c does), try various approaches (compiling the generated C code /tmp/generated-c-1234.c
as a plugin using gcc -O3 -fPIC /tmp/generated-c-1234.c -shared -o /tmp/generated-plugin-1234.so
, then dlopen(3)-ing and dlsym(3)-ing that /tmp/generated-plugin-1234.so
generated plugin), and use machine learning techniques to find a very good one (specific to the current hardware and computer). It could also generate machine code more directly using asmjit or libgccjit,尝试多种方法,然后选择适合特定情况的最佳方法。
皮特拉的书 Artificial Beings and blog (still here) explains in more details this approach. The conceptual framework is called partial evaluation. See also this.
您还可以使用 现有的 解析器生成器,例如 GNU bison or ANTLR。他们正在生成 C 代码。
伊恩·泰勒 libbacktrace could also be useful in such a dynamic metaprogramming approach (generating various form of C code, and choosing the best ones according to the call stack inspected with dladdr(3)).
您的问题很可能是 parsing problem. So read the first half of the Dragon book。
在尝试任何实验之前,与您 manager/boss/client 讨论是否有机会花费数月的全职工作来获得百分之几的性能。考虑到升级硬件也可以获得同样的增益。
如果您的 TB 输入文本文件不经常更改(例如每周给出,例如 bioinformatics software), it may be worthwhile to preprocess it and transform it -in batch mode- into a binary file, or some sqlite database, or some GDBM indexed file, or a some REDIS thing. Then documenting the format of that binary file or database (using EBNF notation, taking inspiration from elf(5)),则非常重要。
要打印文件的内容,可以使用 getc
:
int ch;
FILE *file = fopen("file.txt", "r");
while ((ch = getc(file)) != EOF) {
// do something
}
getc
函数的效率如何?也就是说,它实际上执行操作系统调用或需要花费大量时间的操作的频率是多少?例如,假设我有一个 10TB 的文件——调用此函数数万亿次会不会是一种糟糕的数据获取方式?
That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time?
您可以查看 GNU libc or of musl-libc to study the implementation of getc
. You should also study the implementation of cat(1) and wc(1). Both are open source. And GNU as
(part of GNU binutils) is a free software (internally used by most compilations by GCC) 的源代码,它实际上运行得非常快并且可以进行文本操作(将汇编程序文本输入转换为二进制目标文件)。 您可以从其源代码中获取灵感。
您可以使用 setvbuf(3)
更改缓冲区大小您可能希望使用 fread(3) or fgets(3) 一次读取几个字节,可能是几千字节的数据片段
你也可以使用调试器gdb(1) or the strace(1) utility to find out when syscalls(2)都用了哪些。
For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?
很可能不是,因为内核的 page cache。
您应该对您的程序进行概要分析和基准测试,以找出其瓶颈。
大多数时候不会 getc
。请参阅 time(7) and gprof(1) (and compile all your code with GCC 调用为 gcc -O3 -pg -Wall
)
如果原始输入性能在您的程序中至关重要,请考虑直接且明智地使用 open(2), read(2), mmap(2), madvise(2), readahead(2), posix_fadvise(2), close(2). Most of these syscalls could fail, see errno(3)。
您还可以更改文件系统(例如,从 Ext4 到 XFS,请参阅 ext4(5) and xfs(5)), buy better SSD disks or more physical RAM, or play with mount(2) 选项,以提高性能。
另见 /proc
伪文件系统(所以 proc(5)...); and
您可能想使用像 sqlite or PostGreSQL
这样的数据库您的程序可以在运行时生成 C 代码(如 manydl.c does), try various approaches (compiling the generated C code /tmp/generated-c-1234.c
as a plugin using gcc -O3 -fPIC /tmp/generated-c-1234.c -shared -o /tmp/generated-plugin-1234.so
, then dlopen(3)-ing and dlsym(3)-ing that /tmp/generated-plugin-1234.so
generated plugin), and use machine learning techniques to find a very good one (specific to the current hardware and computer). It could also generate machine code more directly using asmjit or libgccjit,尝试多种方法,然后选择适合特定情况的最佳方法。
皮特拉的书 Artificial Beings and blog (still here) explains in more details this approach. The conceptual framework is called partial evaluation. See also this.
您还可以使用 现有的 解析器生成器,例如 GNU bison or ANTLR。他们正在生成 C 代码。
伊恩·泰勒 libbacktrace could also be useful in such a dynamic metaprogramming approach (generating various form of C code, and choosing the best ones according to the call stack inspected with dladdr(3)).
您的问题很可能是 parsing problem. So read the first half of the Dragon book。
在尝试任何实验之前,与您 manager/boss/client 讨论是否有机会花费数月的全职工作来获得百分之几的性能。考虑到升级硬件也可以获得同样的增益。
如果您的 TB 输入文本文件不经常更改(例如每周给出,例如 bioinformatics software), it may be worthwhile to preprocess it and transform it -in batch mode- into a binary file, or some sqlite database, or some GDBM indexed file, or a some REDIS thing. Then documenting the format of that binary file or database (using EBNF notation, taking inspiration from elf(5)),则非常重要。