Different results when sorting character vectors

wujohn1990 Published at Dev

wujohn1990

I am wondering how the R sorting algorithm works, when sorting character vector

a = c("aa(150)", "aa(1)S")
sort(a)
# [1] "aa(150)" "aa(1)S" 
a = c("aa(150)", "aa(1)")
sort(a)
# [1] "aa(1)" "aa(150)"

Doesn't R compare the integer value of the characters one by one from left to right? Why adding a character can change the result?

I thought the sorting is determined by the "5" and ")" characters, and characters after are ignored.

For comparison with Python

In [1]: a=["aa(150)","aa(1)"]
In [2]: sorted(a)
Out[2]: ['aa(1)', 'aa(150)']
In [3]: a=["aa(150)","aa(1)S"]
In [4]: sorted(a)
Out[4]: ['aa(1)S', 'aa(150)']

Pierre L

Set the locale to a default that will turn off locale-specific sorting in most cases:

Sys.setlocale("LC_COLLATE", "C")
a=c("aa(150)","aa(1)S")
sort(a)
#[1] "aa(1)S"  "aa(150)"

String collation has to be internationally specific due to language differences. From the help for ?sort:

The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.

We can then go to ?Comparisons for:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g.

As mentioned, because each language uses letters in different ways, the locale matters for sorting.

Collected from the Internet

Please contact [email protected] to delete if infringement.