[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

eryksun eryksun at gmail.com
Wed Sep 5 16:31:16 CEST 2012


On Wed, Sep 5, 2012 at 5:42 AM, Ray Jones <crawlzone at gmail.com> wrote:
> I have directory names that contain Russian characters, Romanian
> characters, French characters, et al. When I search for a file using
> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
> directory names. I thought simply identifying them as Unicode would
> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.

This is just an FYI in case you were manually decoding. Since glob
calls os.listdir(dirname), you can get Unicode output if you call it
with a Unicode arg:

    >>> t = u"\u0456\u0439\u043e"
    >>> open(t, 'w').close()

    >>> import glob

    >>> glob.glob('*')  # UTF-8 output
    ['\xd1\x96\xd0\xb9\xd0\xbe']

    >>> glob.glob(u'*')
    [u'\u0456\u0439\u043e']

Regarding subprocess.Popen, just use Unicode -- at least on a POSIX
system. Popen calls an exec function, such as posix.execv, which
handles encoding Unicode arguments to the file system encoding.

On Windows, the _subprocess C extension in 2.x is limited to calling
CreateProcessA with char* 8-bit strings. So Unicode characters beyond
ASCII (the default encoding) trigger an encoding error.


More information about the Tutor mailing list