Why isn't my re.sub replacing the contents of my MS Word file?

scottcabit at gmail.com scottcabit at gmail.com
Fri May 9 16:49:56 EDT 2014


On Friday, May 9, 2014 4:09:58 PM UTC-4, Tim Chase wrote:

> A Word doc (as your subject mentions) is a binary format.  There's
> the older .doc and the newer .docx (which is actually a .zip file
> with a particular content-structure renamed to .docx).
> 
   I am using .doc files only......

> 
> For the older .doc file, it's a binary format, so even if you can
> successfully find & swap out sequences of 7 chars for a single char,
> it might screw up the internal offsets, breaking your file.

   I do not save the file out again, only try to change all en-dash and em-dash to dashes, then search and print things to another file, closing the searched file without writing it.

> 
> Additionally, I vaguely remember sparring with them using some 16-bit
> wide characters in .doc files so you might have to search for
> atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
> character being prefixed with "\x00".

  Hmmm..thought that was what I was doing. Can anyone figure out why the syntax is wrong for Word 2007 document binary file data?




More information about the Python-list mailing list