unicode question

Wed Mar 1 10:44:31 EST 2006

Edward Loper wrote:

> Walter Dörwald wrote:
>> Edward Loper wrote:
>>
>>> [...]
>>> Surely there's a better way than converting back and forth 3 times?  Is
>>> there a reason that the 'backslashreplace' error mode can't be used 
>>> with codecs.decode?
>>>
>>>  >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
>>> Traceback (most recent call last):
>>>    File "<stdin>", line 1, in ?
>>> TypeError: don't know how to handle UnicodeDecodeError in error callback
>>
>> The backslashreplace error handler is an *error* *handler*, i.e. it 
>> gives you a replacement text if an input character can't be encoded. 
>> But a backslash character in an 8bit string is no error, so it won't 
>> get replaced on decoding.
> 
> I'm not sure I follow exactly -- the input string I gave as an example 
> did not contain any backslash characters.  Unless by "backslash 
> character" you mean a character c such that ord(c)>127.  I guess it 
> depends on which class of errors you think the error handler should be 
> handling. :)  The codec system's pretty complex, so I'm willing to
> accept on faith that there may be a good reason to have error handlers 
> only make replacements in the encode direction, and not in the decode 
> direction.

Both directions are completely non-symmetric. On encoding an error can 
only happen when the character is unencodable (e.g. for charmap codecs 
anything outside the set of 256 characters). On decoding an error means 
that the byte stream violates the internal format of the encoding. But a 
0x5c byte (i.e. a backslash) in e.g. a latin-1 byte sequence doesn't 
violate the internal format of the latin-1 encoding (nothing does), so 
the error handler never kicks in.

>> What you want is a different codec (try e.g. "string-escape" or 
>> "unicode-escape").
> 
> This is very close, but unfortunately won't quite work for my purposes, 
> because it also puts backslashes before "'" and "\\" and maybe a few 
> other characters.  :-/

OK, seems you're stuck with your decode/encode/decode call.

>  >>> print "test: '\xff'".encode('string-escape').decode('ascii')
> test: \'\xff\'
> 
>  >>> print do_what_i_want("test:\xff'")
> test: '\xff'
> 
> I think I'll just have to stick with rolling my own.

Bye,
    Walter Dörwald