[Tutor] Opening filenames with unicode characters
Jerry Hill
malaclypse2 at gmail.com
Thu Jun 28 21:06:03 CEST 2012
On Thu, Jun 28, 2012 at 2:55 PM, James Chapman <james at uplinkzero.com> wrote:
> Why can I not convert my existing byte string into a unicode string?
That would work fine.
> In the mean time I'll create my original string as unicode and see if that
> solves my problem.
>
>>>> fileName = unicode(filename)
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35:
> invalid start byte
Here's a couple of questions that you'll need to answer 'Yes' to
before you're going to get this to work reliably:
Are you familiar with the differences between byte strings and unicode
strings? Do you understand how to convert from one to the other,
using a particular encoding? Do you know what encoding your source
file is saved in? If your string is not coming from a source file,
but some other source of bytes, do you know what encoding those bytes
are using?
Try the following. Before trying to convert filename to unicode, do a
"print repr(filename)". That will show you the byte string, along
with the numeric codes for the non-ascii parts. Then convert those
bytes to a unicode object using the appropriate encoding. If the
bytes are utf-8, then you'd do something like this:
unicode_filename = unicode(filename, 'utf-8')
If your bytestring is actually shift-jis encoded, you'd do this instead:
unicode_filename = unicode(filename, 'shift-jis')
If you don't know what encoding your byte string is in, you either
have to give up, guess, or try a bunch of likely possibilities until
something works. If you really, really have to guess and there's no
way for you to know for sure what encoding a particular byte string is
in, the third party chardet module may be able to help.
--
Jerry
More information about the Tutor
mailing list