[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Wed Sep 5 13:05:30 CEST 2012

On 09/05/2012 03:33 AM, Peter Otten wrote:
> Ray Jones wrote:
>
>> I have directory names that contain Russian characters, Romanian
>> characters, French characters, et al. When I search for a file using
>> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
>> directory names. I thought simply identifying them as Unicode would
>> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.
>
>>> files = [u"\u0456\u0439\u043e"] # files = glob.glob(unicode_pattern)
>>> print files
> [u'\u0456\u0439\u043e']
>
> To see the actual characters print the unicode strings individually:
>
>>>> for file in files:
> ...     print file
> ... 
> ійо
Aha! That works.
>> These representations of directory names are eventually going to be
>> passed to Dolphin (my file manager). Will they pass to Dolphin properly?
> How exactly do you "pass" these names?
I will be calling Dolphin with subprocess.call() and passing the
directories as command line arguments.

> $ cat tmp.py
> # -*- coding: utf-8 -*-
> print u"Я"
> $ python tmp.py
> Я
> $ python tmp.py | cat
> Traceback (most recent call last):
>   File "tmp.py", line 2, in <module>
>     print u"Я"
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u042f' in 
> position 0: ordinal not in range(128)
>
> You can work around that by specifying the appropriate encoding explicitly:
>
> $ python tmp2.py iso-8859-5 | cat
> �
> $ python tmp2.py latin1 | cat
> Traceback (most recent call last):
>   File "tmp2.py", line 4, in <module>
>     print u"Я".encode(encoding)
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in 
> position 0: ordinal not in range(256)
>
But doesn't that entail knowing in advance which encoding you will be
working with? How would you automate the process while reading existing
files?


Ray