unicode question
Edward Loper
edloper at gradient.cis.upenn.edu
Mon Feb 27 16:37:09 EST 2006
Walter Dörwald wrote:
> Edward Loper wrote:
>
>> [...]
>> Surely there's a better way than converting back and forth 3 times? Is
>> there a reason that the 'backslashreplace' error mode can't be used
>> with codecs.decode?
>>
>> >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in ?
>> TypeError: don't know how to handle UnicodeDecodeError in error callback
>
> The backslashreplace error handler is an *error* *handler*, i.e. it
> gives you a replacement text if an input character can't be encoded. But
> a backslash character in an 8bit string is no error, so it won't get
> replaced on decoding.
I'm not sure I follow exactly -- the input string I gave as an example
did not contain any backslash characters. Unless by "backslash
character" you mean a character c such that ord(c)>127. I guess it
depends on which class of errors you think the error handler should be
handling. :) The codec system's pretty complex, so I'm willing to
accept on faith that there may be a good reason to have error handlers
only make replacements in the encode direction, and not in the decode
direction.
> What you want is a different codec (try e.g. "string-escape" or
> "unicode-escape").
This is very close, but unfortunately won't quite work for my purposes,
because it also puts backslashes before "'" and "\\" and maybe a few
other characters. :-/
>>> print "test: '\xff'".encode('string-escape').decode('ascii')
test: \'\xff\'
>>> print do_what_i_want("test:\xff'")
test: '\xff'
I think I'll just have to stick with rolling my own.
-Edward
More information about the Python-list
mailing list