Why isn't my re.sub replacing the contents of my MS Word file?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue May 13 09:49:12 EDT 2014


On Mon, 12 May 2014 10:35:53 -0700, scottcabit wrote:

> On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:
> 
>> Good:
>> 
>> 
>> 
>>     fStr = re.sub(b'&#x2012', b'-', fStr)
>> 
>> 
>   Doesn't work...the document has been verified to contain endash and
>   emdash characters, but this does NOT replace them.

You may have missed my follow up post, where I said I had not noticed you 
were operating on a binary .doc file.

The text content of your doc file might look like:

   This – is an n-dash.


when viewed in Microsoft Word, but that is not the contents on disk. 
Word .doc files are a proprietary, secret binary format. Apart from the 
rest of the document structure and metadata, the text itself could be 
stored any old way. We don't know how. Microsoft surely knows how it is 
stored, but are unlikely to tell. A few open source projects like 
OpenOffice, LibreOffice and Abiword have reverse-engineered the file 
format. Taking a wild guess, I think it could be something like:

    This \xe2\x80\x93 is an n-dash.

or possibly:

    \x00T\x00h\x00i\x00s\x00  \x13\x00 \x00i\x00s\x00 \x00a
    \x00n\x00 \x00n\x00-\x00d\x00a\x00s\x00h\x00.

or:

    This {EN DASH} is an n-dash.

or:

    x\x9c\x0b\xc9\xc8,V\xa8v\xf5Spq\x0c\xf6\xa8U\x00r\x12
    \xf3\x14\xf2tS\x12\x8b3\xf4\x00\x82^\x08\xf8


(that last one is the text passed through the zlib compressor), but 
really I'm just making up vaguely conceivable possibilities.

If you're not willing or able to use a full-blown doc parser, say by 
controlling Word or LibreOffice, the other alternative is to do something 
quick and dirty that might work most of the time. Open a doc file, or 
multiple doc files, in a hex editor and *hopefully* you will be able to 
see chunks of human-readable text where you can identify how en-dashes 
and similar are stored.



-- 
Steven D'Aprano



More information about the Python-list mailing list