Newbie question about text encoding

Marko Rauhamaa marko at pacujo.net
Sat Mar 7 14:01:27 EST 2015


Dan Sommers <dan at tombstonezero.net>:

> I think we're all agreeing: not all file systems are the same, and
> Python doesn't smooth out all of the bumps, even for something that
> seems as simple as displaying the names of files in a directory. And
> that's *after* we've agreed that filesystems contain files in
> hierarchical directories.

A whole new set of problems took root with Unicode. There were gains but
there were losses, too.

Python is not alone in the conceptual difficulties. Guile 2's (readdir)
simply converts bad UTF-8 in a filename into a question mark:

   scheme@(guile-user) [1]> (readdir s)
   $3 = "?"
   scheme@(guile-user) [4]> (equal? $3 "?")
   $4 = #t

So does lxterminal:

   $ ls
   ?

even though it's all bytes on the inside:

   $ [ $(ls) = "?" ]
   $ echo $?
   1

Scripts that make use of standard text utilities must now be very
careful:

   $ ls | egrep "^.$" | wc -l
   0

You are well advised to sprinkle LANG=C in your scripts:

   $ ls | LANG=C egrep "^.$" | wc -l
   1

Nasty locale-related bugs plague installation scripts, whose writers are
not accustomed to running their tests in myriads of locales. The topic
is of course larger than just Unicode.


Marko



More information about the Python-list mailing list