[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Ray Jones crawlzone at gmail.com
Wed Sep 5 13:05:30 CEST 2012


On 09/05/2012 03:33 AM, Peter Otten wrote:
> Ray Jones wrote:
>
>> I have directory names that contain Russian characters, Romanian
>> characters, French characters, et al. When I search for a file using
>> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
>> directory names. I thought simply identifying them as Unicode would
>> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.
>
>>> files = [u"\u0456\u0439\u043e"] # files = glob.glob(unicode_pattern)
>>> print files
> [u'\u0456\u0439\u043e']
>
> To see the actual characters print the unicode strings individually:
>
>>>> for file in files:
> ...     print file
> ... 
> ійо
Aha! That works.
>> These representations of directory names are eventually going to be
>> passed to Dolphin (my file manager). Will they pass to Dolphin properly?
> How exactly do you "pass" these names?
I will be calling Dolphin with subprocess.call() and passing the
directories as command line arguments.

> $ cat tmp.py
> # -*- coding: utf-8 -*-
> print u"Я"
> $ python tmp.py
> Я
> $ python tmp.py | cat
> Traceback (most recent call last):
>   File "tmp.py", line 2, in <module>
>     print u"Я"
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u042f' in 
> position 0: ordinal not in range(128)
>
> You can work around that by specifying the appropriate encoding explicitly:
>
> $ python tmp2.py iso-8859-5 | cat
>> $ python tmp2.py latin1 | cat
> Traceback (most recent call last):
>   File "tmp2.py", line 4, in <module>
>     print u"Я".encode(encoding)
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in 
> position 0: ordinal not in range(256)
>
But doesn't that entail knowing in advance which encoding you will be
working with? How would you automate the process while reading existing
files?


Ray


More information about the Tutor mailing list