I wrote the following script to test the speed of Python's sort functionality:
from sys import stdin, stdout
lines = list(stdin)
lines.sort()
stdout.writelines(lines)
I then compared this to the coreutils sort
command on a file containing 10 million lines:
$ time python sort.py <numbers.txt >s1.txt
real 0m16.707s
user 0m16.288s
sys 0m0.420s
$ time sort <numbers.txt >s2.txt
real 0m45.141s
user 2m28.304s
sys 0m0.380s
The built-in command used all four CPUs (Python only used one) but took about 3 times as long to run! What gives?
I am using Ubuntu 12.04.5 (32-bit), Python 2.7.3, and sort
8.13
Izkata's comment revealed the answer: locale-specific comparisons. The sort
command uses the locale indicated by the environment, whereas Python defaults to a byte order comparison. Comparing UTF-8 strings is harder than comparing byte strings.
$ time (LC_ALL=C sort <numbers.txt >s2.txt)
real 0m5.485s
user 0m14.028s
sys 0m0.404s
How about that.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments