Trouble with unicode

Mon May 14 18:02:33 EDT 2001

>they look about 3 characters long but are only 1 really, I already have
>experience converting Unix characters over.

Sounds like UTF-8. If it is, you can just replace 'latin-1' below with
'utf-8' :-)

> >>> toASCII(u"123\555", "replace")
> ('123?', 4)
> >>> text
> '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc, \xdf'
> >>> toASCII(text, "replace")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)
>
> But I'm no closer, am I? I don't quite understand what the
> codecs module is
> and how it works.

You're closer :-)

OK, it looks like you are starting with a string containing Latin-1
characters. If I understand correctly, you want to remove the characters
that are not in the ASCII set (i.e > 127). There are two ways to do that:

1. Fancy (change 'latin-1' to the actual encoding):

>>> from codecs import lookup
>>> fromLatin1 = lookup( 'latin-1' )[1]
>>> toASCII = lookup( 'ASCII' )[0]
>>> asLatin1, dummy = fromLatin1( '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc,
\xdf' )
>>> toASCII( asLatin1, 'replace' )
('?, ?, ?, ?, ?, ?, ?', 19)

2. Simple (will not work with UTF-8!):

>>> test = '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc, \xdf'
>>> testOut = ''
>>> for i in test:
...     if ord(i) > 127:
...             testOut += '?'
...     else:
...             testOut += i
...
>>> testOut
'?, ?, ?, ?, ?, ?, ?'