Why isn't my re.sub replacing the contents of my MS Word file?

Chris Angelico rosuav at gmail.com
Fri May 9 16:08:01 EDT 2014


On Sat, May 10, 2014 at 5:51 AM,  <scottcabit at gmail.com> wrote:
>   But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?
>
>   Thanks!
>
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)
>     re.sub(b'&#x2013','-',fStr)
>     re.sub(b'&#x2014','-',fStr)
>     re.sub(b'&#x2015','-',fStr)
>     re.sub(b'&#x2E3A','-',fStr)
>     re.sub(b'&#x2E3B','-',fStr)
>     re.sub(b'&#x002D','-',fStr)
>     re.sub(b'&#x00AD','-',fStr)

I can see several things that might be wrong, but it's hard to say
what *is* wrong without trying it.

1) Is the file close enough to text that you can even do this sort of
parsing? You say it's an MS Word file; that, unfortunately, could mean
a lot of things. Some of the newer formats are basically zipped XML,
so translations like this won't work. Other forms of Word document may
be closer to text, but you majorly risk corrupting the binary content.

2) How are characters represented? Are they actually stored in the
file with ampersands, hashes, etc? Your source strings are all seven
bytes long, and will look for exactly those bytes. There must be some
form of character encoding used; possibly, instead of the &#x
notation, you need to UTF-8 or UTF-16LE encode the characters to look
for.

3) You're doing simple string replacements using regular expressions.
I don't think any of your symbols here is a metacharacter, but I might
be wrong. If you're simply replacing one stream of bytes with another,
don't use regex at all, just use string replacement.

4) There's nothing in your current code to actually write the contents
anywhere. You do all the changes and then do nothing with it. Or is
this just part of the code?

5) Similarly, there's nothing in this fragment that actually calls
processdoc(). Did you elide that? The fragment you wrote will do a
whole lot of nothing, on its own.

6) There's no file extension on your input file name; be sure you
really have the file you want, and not (for instance) a directory. Or
if you need to iterate over all the files in a directory, you'll need
to do that explicitly.

7) This one isn't technically a problem, but it's a risk. The string
'z:\Documentation\Software' has two backslash escapes \D and \S, which
the parser fails to recognize, and therefore passes through literally.
So it works, currently. However, if you were to change the path to,
say, 'z:\Documentation\backups', then it would suddenly fail. There
are several solutions to this:
7a) fn = r'z:\Documentation\Software'
7b) fn = 'z:\\Documentation\\Software'
7c) fn = 'z:/Documentation/Software'

Hope that helps some, at least! A more full program would be easier to
work with.

ChrisA



More information about the Python-list mailing list