is this a unicode/string bug?

Fri Dec 9 16:51:37 EST 2005

olsongt at verizon.net wrote:

> I was going to submit to sourceforge, but my unicode skills are weak.
> I was trying to strip characters from a string that contained values
> outside of ASCII.  I though I could just encode as 'ascii' in 'replace'
> mode but it threw an error.  Strangely enough, if I decode via the
> ascii codec and then encode via the ascii codec, I get what I want.
> That being said, this may be operating correctly.

encode on 8-bit strings and decode on unicode strings aren't exactly
obvious operations...

> >>> print 'aaa\xae'
> aaa®
> >>> 'aaa\xae'.encode('ascii','replace') #should return 'aaa?'

encode("ascii") is a unicode operation, so when you do this, Python first
attempts to turn your string into a unicode string, using the default en-
coding.  that operation fails:

Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 3:
ordinal not in range(128)

>>> 'aaa\xae'.decode('ascii','replace') #but this doesn't throw an error?
u'aaa\ufffd'

this converts the encoded stream to Unicode, using a "suitable replacement
character" for characters that cannot be converted.  U+FFFD is 'REPLACEMENT
CHARACTER', which, I assume, is about as suitable as you can get.

>>> 'aaa\xae'.decode('ascii','replace').encode('ascii','replace') #this does what I wanted
'aaa?'

this converts the unicode string from the previous step back to ascii, using
a "suitable replacement character" for characters than cannot be converted.
for 8-bit strings, "?" is a suitable character.

instead of playing codec games, you could use translate or a simple regular
expression:

    outstring = re.sub("[\x80-\xff]", "?", instring)

</F>