unicode question

Mon Feb 27 16:37:09 EST 2006

Walter Dörwald wrote:
> Edward Loper wrote:
> 
>> [...]
>> Surely there's a better way than converting back and forth 3 times?  Is
>> there a reason that the 'backslashreplace' error mode can't be used 
>> with codecs.decode?
>>
>>  >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in ?
>> TypeError: don't know how to handle UnicodeDecodeError in error callback
> 
> The backslashreplace error handler is an *error* *handler*, i.e. it 
> gives you a replacement text if an input character can't be encoded. But 
> a backslash character in an 8bit string is no error, so it won't get 
> replaced on decoding.

I'm not sure I follow exactly -- the input string I gave as an example 
did not contain any backslash characters.  Unless by "backslash 
character" you mean a character c such that ord(c)>127.  I guess it 
depends on which class of errors you think the error handler should be 
handling. :)  The codec system's pretty complex, so I'm willing to 
accept on faith that there may be a good reason to have error handlers 
only make replacements in the encode direction, and not in the decode 
direction.

> What you want is a different codec (try e.g. "string-escape" or 
> "unicode-escape").

This is very close, but unfortunately won't quite work for my purposes, 
because it also puts backslashes before "'" and "\\" and maybe a few 
other characters.  :-/

 >>> print "test: '\xff'".encode('string-escape').decode('ascii')
test: \'\xff\'

 >>> print do_what_i_want("test:\xff'")
test: '\xff'

I think I'll just have to stick with rolling my own.

-Edward