Unicode conversion problem (codec can't decode)

Eric S. Johansson esj at harvee.org
Fri Apr 4 00:35:08 EDT 2008


I'm having a problem (Python 2.4) converting strings with random 8-bit 
characters into an escape form which is 7-bit clean for storage in a database. 
Here's an example:

body = meta['mini_body'].encode('unicode-escape')

when given an 8-bit string, (in meta['mini_body']), the code fragment above 
yields the error below.

'ascii' codec can't decode byte 0xe1 in position 13: ordinal not in range(128)

the string that generates that error is:

<br>Reduce Whát You Owe by 50%. Get out of debt today!<br>Reduuce Interest & 
|V|onthlyy Paymeñts Easy, we will show you how..<br>Freee Quote in 10 
Min.<br>http://www.freefromdebtin.net.cn

I've read a lot of stuff about Unicode and Python and I'm pretty comfortable 
with how you can convert between different encoding types.  What I don't 
understand is how to go from a byte string with 8-bit characters to an encoded 
string where 8-bit characters are turned into  two character hexadecimal sequences.

I really don't care about the character set used.  I'm looking for a matched set 
of operations that converts the string to a seven bits a form and back to its 
original form.  Since I need the ability to match a substring of the original 
text while the string is in it's encoded state, something like Unicode-escaped 
encoding would work well for me.  unfortunately, I am missing some knowledge 
about encoding and decoding.  I wish I knew what cjson was doing because it does 
the right things for my project.  It takes strings or Unicode, stores everything 
as Unicode and then returns everything as Unicode.  Quite frankly, I love to 
have my entire system run using Unicode strings but again, I missing some 
knowledge on how to force all of my modules to be Unicode by default

any enlightenment would be most appreciated.

---eric


-- 
Speech-recognition in use.  It makes mistakes, I correct some.



More information about the Python-list mailing list