BUG? Python 2.0 chokes on international characters in Unicode strings

Wed Jan 31 09:01:43 EST 2001

Jurie Horneman wrote:
> Is this a known bug? (If so, AAARRGGHHH - could it really be that Python
> basically doesn't work outside of the US? Hard to believe: these characters
> are used in the Dutch language...)

it's not a bug -- it's just that there are only 127 ASCII code
points, and 40,000+ unicode characters...

> Is there some workaround? Could I convert Unicode strings to ASCII? If so,
> how?

use the encode method:

    u = any unicode string
    s = str(u) # will fail for non-ascii characters
    s = u.encode("utf-8") # will always work
    s = u.encode("iso-8859-1") # may fail for non-latin-1 characters
    s = u.encode("ascii", "ignore") # won't fail, but may lose chars
    s = u.encode("ascii", "replace") # won't fail, will replace non-ascii chars

    import locale
    # get the most likely output encoding (works on most unix,
    # windows, and macintosh installations)
    loc, enc = locale.getdefaultlocale()
    print u.encode(enc, "replace")

see
http://www.python.org/doc/current/lib/string-methods.html
for more info on the encode method,

http://www.python.org/doc/current/lib/module-codecs.html
for more info on codecs (including stream codecs)

Cheers /F