not quite 1252

John Machin sjmachin at lexicon.net
Thu Apr 27 04:23:04 EDT 2006


On 27/04/2006 12:49 AM, Anton Vredegoor wrote:
> Fredrik Lundh wrote:
> 
>> Anton Vredegoor wrote:
>>
>>> I'm trying to import text from an open office document (save as .sxw and
>>>  read the data from content.xml inside the sxw-archive using
>>> elementtree and such tools).
>>>
>>> The encoding that gives me the least problems seems to be cp1252,
>>> however it's not completely perfect because there are still characters
>>> in it like \93 or \94. Has anyone handled this before?
>>
>> this might help:
>>
>>     http://effbot.org/zone/unicode-gremlins.htm
> 
> Thanks a lot! The code below not only made the strange chars go away, 
> but it also fixed the xml-parsing errors 

What xml-parsing errors were they??

> ... Maybe it's useful to 
> someone else too, use at own risk though.
> 
> Anton
> 
> from gremlins import kill_gremlins
> from zipfile import ZipFile, ZIP_DEFLATED
> 
> def repair(infn,outfn):
>     zin  = ZipFile(infn, 'r', ZIP_DEFLATED)
>     zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
>     for x in zin.namelist():
>         data = zin.read(x)
>         if x == 'contents.xml':

Firstly, this should be 'content.xml', not 'contents.xml'.

Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8 
e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in 
UTF-8. The kill_gremlins function is intended to fix Unicode strings 
that have been obtained by decoding 8-bit strings using 'latin1' instead 
of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins 
function, it changes the \x80 to a Euro symbol, and leaves the other two 
alone. Because the \x9d is not defined in cp1252, it then causes your 
code to die in a hole when you attempt to encode it as cp1252: 
UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in 
position 1761: character maps to <undefined>

I don't see how this code repairs anything (quite the contrary!), unless 
there's some side effect of just read/writestr. Enlightenment, please.

>             zout.writestr(x,kill_gremlins(data).encode('cp1252'))
>         else:
>             zout.writestr(x,data)
>     zout.close()




More information about the Python-list mailing list