how can I convert invalid ASCII string to Unicode?

skip at pobox.com skip at pobox.com
Tue May 8 23:29:20 EDT 2001


I have been blissfully ignoring Unicode.  Alas, my bliss has been so rudely
interrupted...

Suppose I have this string:

    s = "ö"	    # "o" with an umlaut

and I'd like to convert it to UTF-8.  (I know I can preface string literals
with 'u', but that's not an option here.  Pretend s was assigned from a file
read.)

Simply executing

    u = unicode(s)

fails because ord(s) is > 127.  I eventually figured out that the following
would work:

    u = "".join([unichr(ord(c)) for c in s])

but this seems a bit obscure.  Is there a cleaner way to convert plain
strings containing characters > 127 to UTF-8?  Ideally I guess I'd like
plain strings to be interpreted as Latin-1 instead of ASCII by default, even
though my locale is 'murican.

Thx,

-- 
Skip Montanaro (skip at pobox.com)
(847)971-7098




More information about the Python-list mailing list