[Python-ideas] PEP 540: Add a new UTF-8 mode

INADA Naoki songofacandy at gmail.com
Wed Jan 11 03:17:46 EST 2017


Here is one example of locale pitfall.

---
# from http://unix.stackexchange.com/questions/169739/why-is-coreutils-sort-slower-than-python

$ cat letters.py
import string
import random

def main():
    for _ in range(1_000_000):
        c = random.choice(string.ascii_letters)
        print(c)

main()

$ python3 letters.py > letters.txt

$ LC_ALL=C time sort letters.txt > /dev/null
        0.35 real         0.32 user         0.02 sys

$ LC_ALL=C.UTF-8 time sort letters.txt > /dev/null
        0.36 real         0.33 user         0.02 sys

$ LC_ALL=ja_JP.UTF-8 time sort letters.txt > /dev/null
       11.03 real        10.95 user         0.04 sys

$ LC_ALL=en_US.UTF-8 time sort letters.txt > /dev/null
       11.05 real        10.97 user         0.04 sys
---

This is why some engineer including me use C locale on Linux,
at least when there are no C.UTF-8 locale.

Off course, we can use LC_CTYPE=en_US.UTF-8, instead of LANG or LC_ALL.
(I wonder if we can use LC_CTYPE=UTF-8...)

But I dislike current situation that "people should learn
how to configure locale properly, and pitfall of non-C locale, only for
using UTF-8 on Python".


More information about the Python-ideas mailing list