[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Wed Sep 5 12:33:46 CEST 2012

Ray Jones wrote:

> I have directory names that contain Russian characters, Romanian
> characters, French characters, et al. When I search for a file using
> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
> directory names. I thought simply identifying them as Unicode would
> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.

That's the representation which is guaranteed to be all-ascii. Python will 
automatically apply repr() to a unicode string when it is part of a list

>>> files = [u"\u0456\u0439\u043e"] # files = glob.glob(unicode_pattern)
>>> print files
[u'\u0456\u0439\u043e']

To see the actual characters print the unicode strings individually:

>>> for file in files:
...     print file
... 
ійо

> These representations of directory names are eventually going to be
> passed to Dolphin (my file manager). Will they pass to Dolphin properly?

How exactly do you "pass" these names?

> Do I need to run a conversion? 

When you write them to a file you need to pick an encoding.

> Can that happen automatically within the
> script considering that the various types of characters are all mixed
> together in the same directory (i.e. # coding: Latin-1 at the top of the
> script is not going to address all the different types of characters).

the coding cookie tells python how to interpret the bytes in the files, so

# -*- coding: utf-8 -*-
s = u"äöü"

and

# -*- coding: latin1 -*-
s = u"äöü"

contain a different byte sequence on disc, but once imported the two strings 
are equal (and have the same in-memory layout):

>>> import codecs
>>> for encoding in "latin-1", "utf-8":
...     with codecs.open("tmp_%s.py" % encoding.replace("-", ""), "w", 
encoding=encoding) as f: f.write(u'# -*- coding: %s\ns = u"äöü"' % 
encoding)... 
>>> for encoding in "latin1", "utf8":
...     open("tmp_%s.py" % encoding).read()
... 
'# -*- coding: latin-1\ns = u"\xe4\xf6\xfc"'
'# -*- coding: utf-8\ns = u"\xc3\xa4\xc3\xb6\xc3\xbc"'
>>> from tmp_latin1 import s
>>> from tmp_utf8 import s as t
>>> s == t
True

> While on the subject, I just read through the Unicode info for Python
> 2.7.3. The history was interesting, but the implementation portion was
> beyond me. I was looking for a way for a Russian 'backward R' to look
> like a Russian 'backward R' - not for a bunch of \xxx and \uxxxxx stuff.

>>> ya = u"\N{CYRILLIC CAPITAL LETTER YA}"
>>> ya
u'\u042f'
>>> print ya
Я

This only works because Python correctly guesses the terminal encoding. If 
you are piping output to another file it will assume ascii and you will see 
an encoding error:

$ cat tmp.py
# -*- coding: utf-8 -*-
print u"Я"
$ python tmp.py
Я
$ python tmp.py | cat
Traceback (most recent call last):
  File "tmp.py", line 2, in <module>
    print u"Я"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u042f' in 
position 0: ordinal not in range(128)

You can work around that by specifying the appropriate encoding explicitly:

$ python tmp2.py iso-8859-5 | cat
�
$ python tmp2.py latin1 | cat
Traceback (most recent call last):
  File "tmp2.py", line 4, in <module>
    print u"Я".encode(encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in 
position 0: ordinal not in range(256)