[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)
Peter Otten
__peter__ at web.de
Wed Sep 5 12:33:46 CEST 2012
Ray Jones wrote:
> I have directory names that contain Russian characters, Romanian
> characters, French characters, et al. When I search for a file using
> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
> directory names. I thought simply identifying them as Unicode would
> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.
That's the representation which is guaranteed to be all-ascii. Python will
automatically apply repr() to a unicode string when it is part of a list
>>> files = [u"\u0456\u0439\u043e"] # files = glob.glob(unicode_pattern)
>>> print files
[u'\u0456\u0439\u043e']
To see the actual characters print the unicode strings individually:
>>> for file in files:
... print file
...
ійо
> These representations of directory names are eventually going to be
> passed to Dolphin (my file manager). Will they pass to Dolphin properly?
How exactly do you "pass" these names?
> Do I need to run a conversion?
When you write them to a file you need to pick an encoding.
> Can that happen automatically within the
> script considering that the various types of characters are all mixed
> together in the same directory (i.e. # coding: Latin-1 at the top of the
> script is not going to address all the different types of characters).
the coding cookie tells python how to interpret the bytes in the files, so
# -*- coding: utf-8 -*-
s = u"äöü"
and
# -*- coding: latin1 -*-
s = u"äöü"
contain a different byte sequence on disc, but once imported the two strings
are equal (and have the same in-memory layout):
>>> import codecs
>>> for encoding in "latin-1", "utf-8":
... with codecs.open("tmp_%s.py" % encoding.replace("-", ""), "w",
encoding=encoding) as f: f.write(u'# -*- coding: %s\ns = u"äöü"' %
encoding)...
>>> for encoding in "latin1", "utf8":
... open("tmp_%s.py" % encoding).read()
...
'# -*- coding: latin-1\ns = u"\xe4\xf6\xfc"'
'# -*- coding: utf-8\ns = u"\xc3\xa4\xc3\xb6\xc3\xbc"'
>>> from tmp_latin1 import s
>>> from tmp_utf8 import s as t
>>> s == t
True
> While on the subject, I just read through the Unicode info for Python
> 2.7.3. The history was interesting, but the implementation portion was
> beyond me. I was looking for a way for a Russian 'backward R' to look
> like a Russian 'backward R' - not for a bunch of \xxx and \uxxxxx stuff.
>>> ya = u"\N{CYRILLIC CAPITAL LETTER YA}"
>>> ya
u'\u042f'
>>> print ya
Я
This only works because Python correctly guesses the terminal encoding. If
you are piping output to another file it will assume ascii and you will see
an encoding error:
$ cat tmp.py
# -*- coding: utf-8 -*-
print u"Я"
$ python tmp.py
Я
$ python tmp.py | cat
Traceback (most recent call last):
File "tmp.py", line 2, in <module>
print u"Я"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u042f' in
position 0: ordinal not in range(128)
You can work around that by specifying the appropriate encoding explicitly:
$ python tmp2.py iso-8859-5 | cat
�
$ python tmp2.py latin1 | cat
Traceback (most recent call last):
File "tmp2.py", line 4, in <module>
print u"Я".encode(encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in
position 0: ordinal not in range(256)
More information about the Tutor
mailing list