Collating order strangenitude

Am I missing something, or is sort broken?

= echo "< a\n< z\n> d" | LC_ALL='' sort
< a
> d
< z
= echo "< a\n< z\n> d" | LC_ALL=C sort
< a
< z
> d

My understanding is that LC_ALL is supposed to affect the sorting order, but as we see above, it appears to affect which field is the key. WT?

Everything is ok – in first case it uses your locale, in second it uses byte comparison (so < is first, because it’s 60 in ASCII, while >=62). You can add –debug flag to sort to see what rules was used

$ echo "< a\n< z\n> d" | LC_ALL='' sort --debug
sort: tekst uporządkowany przez użycie reguł sortowania `pl_PL.UTF-8'
< a
___
> d
___
< z
___

$ echo "< a\n< z\n> d" | LC_ALL='C' sort --debug
sort: text ordering performed using simple byte comparison
< a
___
< z
___
> d
___
4 Likes

I don’t understand how using my locale means the first byte can be ignored.

In the sort man page it does have a warning:

*** WARNING *** The locale specified by the environment affects
sort order. Set LC_ALL=C to get the traditional sort order that
uses native byte values.

And there is more detail here about the possibility of a locale sort-order ignoring, for instance, punctuation (and I suppose < and > can maybe be thought of as punctuation):

https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

If you have the time to spare, you could also review locale sorting rules: UTS #10: Unicode Collation Algorithm

4 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.