Trouble with unicode
Brian Quinlan
BrianQ at ActiveState.com
Mon May 14 18:02:33 EDT 2001
>they look about 3 characters long but are only 1 really, I already have
>experience converting Unix characters over.
Sounds like UTF-8. If it is, you can just replace 'latin-1' below with
'utf-8' :-)
> >>> toASCII(u"123\555", "replace")
> ('123?', 4)
> >>> text
> '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc, \xdf'
> >>> toASCII(text, "replace")
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)
>
> But I'm no closer, am I? I don't quite understand what the
> codecs module is
> and how it works.
You're closer :-)
OK, it looks like you are starting with a string containing Latin-1
characters. If I understand correctly, you want to remove the characters
that are not in the ASCII set (i.e > 127). There are two ways to do that:
1. Fancy (change 'latin-1' to the actual encoding):
>>> from codecs import lookup
>>> fromLatin1 = lookup( 'latin-1' )[1]
>>> toASCII = lookup( 'ASCII' )[0]
>>> asLatin1, dummy = fromLatin1( '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc,
\xdf' )
>>> toASCII( asLatin1, 'replace' )
('?, ?, ?, ?, ?, ?, ?', 19)
2. Simple (will not work with UTF-8!):
>>> test = '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc, \xdf'
>>> testOut = ''
>>> for i in test:
... if ord(i) > 127:
... testOut += '?'
... else:
... testOut += i
...
>>> testOut
'?, ?, ?, ?, ?, ?, ?'
More information about the Python-list
mailing list