Why isn't my re.sub replacing the contents of my MS Word file?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue May 13 09:49:12 EDT 2014
On Mon, 12 May 2014 10:35:53 -0700, scottcabit wrote:
> On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:
>
>> Good:
>>
>>
>>
>> fStr = re.sub(b'‒', b'-', fStr)
>>
>>
> Doesn't work...the document has been verified to contain endash and
> emdash characters, but this does NOT replace them.
You may have missed my follow up post, where I said I had not noticed you
were operating on a binary .doc file.
The text content of your doc file might look like:
This – is an n-dash.
when viewed in Microsoft Word, but that is not the contents on disk.
Word .doc files are a proprietary, secret binary format. Apart from the
rest of the document structure and metadata, the text itself could be
stored any old way. We don't know how. Microsoft surely knows how it is
stored, but are unlikely to tell. A few open source projects like
OpenOffice, LibreOffice and Abiword have reverse-engineered the file
format. Taking a wild guess, I think it could be something like:
This \xe2\x80\x93 is an n-dash.
or possibly:
\x00T\x00h\x00i\x00s\x00 \x13\x00 \x00i\x00s\x00 \x00a
\x00n\x00 \x00n\x00-\x00d\x00a\x00s\x00h\x00.
or:
This {EN DASH} is an n-dash.
or:
x\x9c\x0b\xc9\xc8,V\xa8v\xf5Spq\x0c\xf6\xa8U\x00r\x12
\xf3\x14\xf2tS\x12\x8b3\xf4\x00\x82^\x08\xf8
(that last one is the text passed through the zlib compressor), but
really I'm just making up vaguely conceivable possibilities.
If you're not willing or able to use a full-blown doc parser, say by
controlling Word or LibreOffice, the other alternative is to do something
quick and dirty that might work most of the time. Open a doc file, or
multiple doc files, in a hex editor and *hopefully* you will be able to
see chunks of human-readable text where you can identify how en-dashes
and similar are stored.
--
Steven D'Aprano
More information about the Python-list
mailing list