Newbie question about text encoding
Marko Rauhamaa
marko at pacujo.net
Sat Mar 7 14:01:27 EST 2015
Dan Sommers <dan at tombstonezero.net>:
> I think we're all agreeing: not all file systems are the same, and
> Python doesn't smooth out all of the bumps, even for something that
> seems as simple as displaying the names of files in a directory. And
> that's *after* we've agreed that filesystems contain files in
> hierarchical directories.
A whole new set of problems took root with Unicode. There were gains but
there were losses, too.
Python is not alone in the conceptual difficulties. Guile 2's (readdir)
simply converts bad UTF-8 in a filename into a question mark:
scheme@(guile-user) [1]> (readdir s)
$3 = "?"
scheme@(guile-user) [4]> (equal? $3 "?")
$4 = #t
So does lxterminal:
$ ls
?
even though it's all bytes on the inside:
$ [ $(ls) = "?" ]
$ echo $?
1
Scripts that make use of standard text utilities must now be very
careful:
$ ls | egrep "^.$" | wc -l
0
You are well advised to sprinkle LANG=C in your scripts:
$ ls | LANG=C egrep "^.$" | wc -l
1
Nasty locale-related bugs plague installation scripts, whose writers are
not accustomed to running their tests in myriads of locales. The topic
is of course larger than just Unicode.
Marko
More information about the Python-list
mailing list