Why isn't my re.sub replacing the contents of my MS Word file?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri May 9 20:31:39 EDT 2014


On Fri, 09 May 2014 13:49:56 -0700, scottcabit wrote:

> On Friday, May 9, 2014 4:09:58 PM UTC-4, Tim Chase wrote:
> 
>> A Word doc (as your subject mentions) is a binary format.  There's the
>> older .doc and the newer .docx (which is actually a .zip file with a
>> particular content-structure renamed to .docx).
>> 
>    I am using .doc files only......

Ah, my previous email missed the fact that you are operating on Word docs.

>> For the older .doc file, it's a binary format, so even if you can
>> successfully find & swap out sequences of 7 chars for a single char, it
>> might screw up the internal offsets, breaking your file.
> 
>    I do not save the file out again, only try to change all en-dash and
>    em-dash to dashes, then search and print things to another file,
>    closing the searched file without writing it.
> 
> 
>> Additionally, I vaguely remember sparring with them using some 16-bit
>> wide characters in .doc files so you might have to search for atrocious
>> things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each character
>> being prefixed with "\x00".
> 
>   Hmmm..thought that was what I was doing. Can anyone figure out why the
>   syntax is wrong for Word 2007 document binary file data?

You are searching for the literal "&#x2012", in other words:

    ampersand hash x two zero one two

*not* a FIGURE DASH. Compare:


py> import re
py> source = b'aaaa&#x2012aaaa'
py> print(source)
b'aaaa&#x2012aaaa'
py> re.sub(b'&#x2012', b'Z', source)
b'aaaaZaaaa'

But if the source contains an *actual* FIGURE DASH:

py> source = u'aaaa\u2012aaaa'.encode('utf-8')
py> print(source)
b'aaaa\xe2\x80\x92aaaa'
py> re.sub(b'&#x2012', b'Z', source)
b'aaaa\xe2\x80\x92aaaa'


You're dealing with a binary file format, and I believe it is an 
undocumented binary file format. You don't know which parts of the file 
represent text, metadata, formatting and layout information, or images. 
Even if you identify which parts are text, you don't know what encoding 
is used internally:

py> u'aaaa\u2012aaaa'.encode('utf-8')
b'aaaa\xe2\x80\x92aaaa'
py> u'aaaa\u2012aaaa'.encode('utf-16be')
b'\x00a\x00a\x00a\x00a \x12\x00a\x00a\x00a\x00a'
py> u'aaaa\u2012aaaa'.encode('utf-16le')
b'a\x00a\x00a\x00a\x00\x12 a\x00a\x00a\x00a\x00'

or something else.

You're on *extremely* thin ice here.

If you *must* do this, then you'll need to identify how Word stores 
various dashes in the file. If you're lucky, the textual parts of the doc 
file will be obvious to the eye, so open a few sample files using a hex 
editor and you might be able to identify what Word is using to store the 
various forms of dash.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/



More information about the Python-list mailing list