Why isn't my re.sub replacing the contents of my MS Word file?

Tim Chase python.list at tim.thechases.com
Fri May 9 16:09:58 EDT 2014


On 2014-05-09 12:51, scottcabit at gmail.com wrote:
>  here is a snippet of code that opens a file (fn contains the
> path\name) and first tried to replace all endash, emdash etc
> characters with simple dash characters, before doing a search. But
> the replaces are not having any effect. Obviously a syntax
> problem....wwhat silly thing am I doing wrong?
> 
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)
>     re.sub(b'&#x2013','-',fStr)
>     re.sub(b'&#x2014','-',fStr)
>     re.sub(b'&#x2015','-',fStr)
>     re.sub(b'&#x2E3A','-',fStr)
>     re.sub(b'&#x2E3B','-',fStr)
>     re.sub(b'&#x002D','-',fStr)
>     re.sub(b'&#x00AD','-',fStr)

A Word doc (as your subject mentions) is a binary format.  There's
the older .doc and the newer .docx (which is actually a .zip file
with a particular content-structure renamed to .docx).

Your example doesn't show the extension, so it's hard to tell whether
you're working with the old format or the new format.

That said, a simple replacement *certainly* won't work for a .docx
file, as you'd have to uncompress the contents, open up the various
files inside, perform the replacements, then zip everything back up,
and save the result back out.

For the older .doc file, it's a binary format, so even if you can
successfully find & swap out sequences of 7 chars for a single char,
it might screw up the internal offsets, breaking your file.
Additionally, I vaguely remember sparring with them using some 16-bit
wide characters in .doc files so you might have to search for
atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
character being prefixed with "\x00".

-tkc





More information about the Python-list mailing list