Usage of getc with a file

debugcn Published at Dev

David542

To print the contents of a file one can use getc:

int ch;
FILE *file = fopen("file.txt", "r");
while ((ch = getc(file)) != EOF) {
    // do something
}

How efficient is the getc function? That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time? For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

Basile Starynkevitch

That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time?

You could look into the source code of GNU libc or of musl-libc to study the implementation of getc. You should also study the implementation of cat(1) and wc(1). Both are open source. And GNU as (part of GNU binutils) is a free software (internally used by most compilations by GCC) which in practice runs very quickly and does textual manipulation (transforming assembler textual input to binary object files). You could take inspiration from its source code.

You could change the buffer size with setvbuf(3)

You may want to read several bytes at once using fread(3) or fgets(3), probably by data pieces of several kilobytes

You can also use the debugger gdb(1) or the strace(1) utility to find out when syscalls(2) are used and which ones.

For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

Very probably not, because of the kernel's page cache.

You should profile and benchmark your program to find out its bottleneck.

Most of the time it won't be getc. See time(7) and gprof(1) (and compile all your code with GCC invoked as gcc -O3 -pg -Wall)

If raw input performance is critical in your program, consider also using directly and wisely open(2), read(2), mmap(2), madvise(2), readahead(2), posix_fadvise(2), close(2). Most of these syscalls could fail, see errno(3).

You may also change your file system (e.g. from Ext4 to XFS, see ext4(5) and xfs(5)), buy better SSD disks or more physical RAM, or play with mount(2) options, to improve performance.

See also the /proc pseudo-file system (so proc(5)...); and this answer.

You may want to use databases like sqlite or PostGreSQL

Your program could generate C code at runtime (like manydl.c does), try various approaches (compiling the generated C code /tmp/generated-c-1234.c as a plugin using gcc -O3 -fPIC /tmp/generated-c-1234.c -shared -o /tmp/generated-plugin-1234.so, then dlopen(3)-ing and dlsym(3)-ing that /tmp/generated-plugin-1234.so generated plugin), and use machine learning techniques to find a very good one (specific to the current hardware and computer). It could also generate machine code more directly using asmjit or libgccjit, try several approaches, and choose the best one for the particular situation. Pitrat's book Artificial Beings and blog (still here) explains in more details this approach. The conceptual framework is called partial evaluation. See also this.

You could also use existing parser generators like GNU bison or ANTLR. They are generating C code.

Ian Taylor's libbacktrace could also be useful in such a dynamic metaprogramming approach (generating various form of C code, and choosing the best ones according to the call stack inspected with dladdr(3)).

Very probably your problem is a parsing problem. So read the first half of the Dragon book.

Before attempting any experimentation, discuss with your manager/boss/client the opportunity to spend months of full time work to gain a few percent of performance. Take into account that the same gain can be obtained by upgrading the hardware.

If your terabyte input textual file does not change often (e.g. is given every week, e.g. in bioinformatics software), it may be worthwhile to preprocess it and transform it -in batch mode- into a binary file, or some sqlite database, or some GDBM indexed file, or a some REDIS thing. Then documenting the format of that binary file or database (using EBNF notation, taking inspiration from elf(5)) is very important.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-05-29

Comments

0 comments

From Dev

Related Related

Article