I am wondering how the R sorting algorithm works, when sorting character vector
a = c("aa(150)", "aa(1)S")
sort(a)
# [1] "aa(150)" "aa(1)S"
a = c("aa(150)", "aa(1)")
sort(a)
# [1] "aa(1)" "aa(150)"
Doesn't R compare the integer value of the characters one by one from left to right? Why adding a character can change the result?
I thought the sorting is determined by the "5" and ")" characters, and characters after are ignored.
For comparison with Python
In [1]: a=["aa(150)","aa(1)"]
In [2]: sorted(a)
Out[2]: ['aa(1)', 'aa(150)']
In [3]: a=["aa(150)","aa(1)S"]
In [4]: sorted(a)
Out[4]: ['aa(1)S', 'aa(150)']
Set the locale to a default that will turn off locale-specific sorting in most cases:
Sys.setlocale("LC_COLLATE", "C")
a=c("aa(150)","aa(1)S")
sort(a)
#[1] "aa(1)S" "aa(150)"
String collation has to be internationally specific due to language differences. From the help for ?sort
:
The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.
We can then go to ?Comparisons
for:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g.
As mentioned, because each language uses letters in different ways, the locale matters for sorting.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments