Usage of getc with a file

David542

To print the contents of a file one can use getc:

int ch;
FILE *file = fopen("file.txt", "r");
while ((ch = getc(file)) != EOF) {
    // do something
}

How efficient is the getc function? That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time? For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

Basile Starynkevitch

That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time?

You could look into the source code of GNU libc or of musl-libc to study the implementation of getc. You should also study the implementation of cat(1) and wc(1). Both are open source. And GNU as (part of GNU binutils) is a free software (internally used by most compilations by GCC) which in practice runs very quickly and does textual manipulation (transforming assembler textual input to binary object files). You could take inspiration from its source code.

You could change the buffer size with setvbuf(3)

You may want to read several bytes at once using fread(3) or fgets(3), probably by data pieces of several kilobytes

You can also use the debugger gdb(1) or the strace(1) utility to find out when syscalls(2) are used and which ones.

For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

Very probably not, because of the kernel's page cache.

You should profile and benchmark your program to find out its bottleneck.

Most of the time it won't be getc. See time(7) and gprof(1) (and compile all your code with GCC invoked as gcc -O3 -pg -Wall)

If raw input performance is critical in your program, consider also using directly and wisely open(2), read(2), mmap(2), madvise(2), readahead(2), posix_fadvise(2), close(2). Most of these syscalls could fail, see errno(3).

You may also change your file system (e.g. from Ext4 to XFS, see ext4(5) and xfs(5)), buy better SSD disks or more physical RAM, or play with mount(2) options, to improve performance.

See also the /proc pseudo-file system (so proc(5)...); and this answer.

You may want to use databases like sqlite or PostGreSQL

Your program could generate C code at runtime (like manydl.c does), try various approaches (compiling the generated C code /tmp/generated-c-1234.c as a plugin using gcc -O3 -fPIC /tmp/generated-c-1234.c -shared -o /tmp/generated-plugin-1234.so, then dlopen(3)-ing and dlsym(3)-ing that /tmp/generated-plugin-1234.so generated plugin), and use machine learning techniques to find a very good one (specific to the current hardware and computer). It could also generate machine code more directly using asmjit or libgccjit, try several approaches, and choose the best one for the particular situation. Pitrat's book Artificial Beings and blog (still here) explains in more details this approach. The conceptual framework is called partial evaluation. See also this.

You could also use existing parser generators like GNU bison or ANTLR. They are generating C code.

Ian Taylor's libbacktrace could also be useful in such a dynamic metaprogramming approach (generating various form of C code, and choosing the best ones according to the call stack inspected with dladdr(3)).

Very probably your problem is a parsing problem. So read the first half of the Dragon book.

Before attempting any experimentation, discuss with your manager/boss/client the opportunity to spend months of full time work to gain a few percent of performance. Take into account that the same gain can be obtained by upgrading the hardware.

If your terabyte input textual file does not change often (e.g. is given every week, e.g. in bioinformatics software), it may be worthwhile to preprocess it and transform it -in batch mode- into a binary file, or some sqlite database, or some GDBM indexed file, or a some REDIS thing. Then documenting the format of that binary file or database (using EBNF notation, taking inspiration from elf(5)) is very important.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Read from a file using getc and print using putc

From Dev

cant write to file after using fgetc/getc (Windows)

From Dev

I have a segmentation fault in getc () that is reading from a file

From Dev

Getting strange strings in C after reading getc from file

From Dev

How to display characters read from a file using getc()

From Dev

Read From File Loop - Can't Get Into Loop using feof or getc

From Dev

Read reports Bad File Descriptor despite getc successfully using the same fd to read a char

From Dev

Cucumber: usage file in scenario

From Dev

File list as per their usage

From Dev

File System Usage Indicator

From Dev

git exclude file usage

From Dev

The usage of Mipfilter in effect file

From Dev

Segmentation fault in getc

From Dev

correct usage of .gitignore (for file types)

From Java

Unusual usage of .h file in C

From Dev

Usage of mmap and reloading changes to the file

From Dev

CPU and memory usage of a file javascript

From Dev

Restlet Protocol.FILE usage

From Dev

Disable Jenkins Log File Usage

From Dev

Sed Usage to update GRUB file

From Dev

String Library Usage in Yacc File

From Dev

Usage of mmap and reloading changes to the file

From Dev

Behaviour of the swap file and RAM usage

From Dev

practical usage of /etc/networks file

From Dev

CPU and memory usage of a file javascript

From Dev

Segmentation Fault? getc function and arrays

From Dev

Using getc when reading an image?

From Dev

Java JAR memory usage VS class file memory usage

From Java

Postgresql command line variable usage in sql file