not quite 1252
Anton Vredegoor
anton.vredegoor at gmail.com
Wed Apr 26 10:49:23 EDT 2006
Fredrik Lundh wrote:
> Anton Vredegoor wrote:
>
>> I'm trying to import text from an open office document (save as .sxw and
>> read the data from content.xml inside the sxw-archive using
>> elementtree and such tools).
>>
>> The encoding that gives me the least problems seems to be cp1252,
>> however it's not completely perfect because there are still characters
>> in it like \93 or \94. Has anyone handled this before?
>
> this might help:
>
> http://effbot.org/zone/unicode-gremlins.htm
Thanks a lot! The code below not only made the strange chars go away,
but it also fixed the xml-parsing errors ... Maybe it's useful to
someone else too, use at own risk though.
Anton
from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED
def repair(infn,outfn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':
zout.writestr(x,kill_gremlins(data).encode('cp1252'))
else:
zout.writestr(x,data)
zout.close()
def test():
infn = "xxxx.sxw"
outfn = 'dg.sxw'
repair(infn,outfn)
if __name__=='__main__':
test()
More information about the Python-list
mailing list