not quite 1252

Wed Apr 26 10:49:23 EDT 2006

Fredrik Lundh wrote:

> Anton Vredegoor wrote:
> 
>> I'm trying to import text from an open office document (save as .sxw and
>>  read the data from content.xml inside the sxw-archive using
>> elementtree and such tools).
>>
>> The encoding that gives me the least problems seems to be cp1252,
>> however it's not completely perfect because there are still characters
>> in it like \93 or \94. Has anyone handled this before?
> 
> this might help:
> 
>     http://effbot.org/zone/unicode-gremlins.htm

Thanks a lot! The code below not only made the strange chars go away, 
but it also fixed the xml-parsing errors ... Maybe it's useful to 
someone else too, use at own risk though.

Anton

from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED

def repair(infn,outfn):
     zin  = ZipFile(infn, 'r', ZIP_DEFLATED)
     zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
     for x in zin.namelist():
         data = zin.read(x)
         if x == 'contents.xml':
             zout.writestr(x,kill_gremlins(data).encode('cp1252'))
         else:
             zout.writestr(x,data)
     zout.close()

def test():
     infn = "xxxx.sxw"
     outfn = 'dg.sxw'
     repair(infn,outfn)

if __name__=='__main__':
     test()