[Tutor] Opening filenames with unicode characters

Jerry Hill malaclypse2 at gmail.com
Thu Jun 28 21:06:03 CEST 2012


On Thu, Jun 28, 2012 at 2:55 PM, James Chapman <james at uplinkzero.com> wrote:
> Why can I not convert my existing byte string into a unicode string?

That would work fine.

> In the mean time I'll create my original string as unicode and see if that
> solves my problem.
>
>>>> fileName = unicode(filename)
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35:
> invalid start byte

Here's a couple of questions that you'll need to answer 'Yes' to
before you're going to get this to work reliably:

Are you familiar with the differences between byte strings and unicode
strings?  Do you understand how to convert from one to the other,
using a particular encoding?  Do you know what encoding your source
file is saved in?  If your string is not coming from a source file,
but some other source of bytes, do you know what encoding those bytes
are using?

Try the following.  Before trying to convert filename to unicode, do a
"print repr(filename)".  That will show you the byte string, along
with the numeric codes for the non-ascii parts.  Then convert those
bytes to a unicode object using the appropriate encoding.  If the
bytes are utf-8, then you'd do something like this:
unicode_filename = unicode(filename, 'utf-8')

If your bytestring is actually shift-jis encoded, you'd do this instead:
unicode_filename = unicode(filename, 'shift-jis')

If you don't know what encoding your byte string is in, you either
have to give up, guess, or try a bunch of likely possibilities until
something works.  If you really, really have to guess and there's no
way for you to know for sure what encoding a particular byte string is
in, the third party chardet module may be able to help.

-- 
Jerry


More information about the Tutor mailing list