Program inefficiency?

hall.jeff at gmail.com hall.jeff at gmail.com
Sat Sep 29 14:24:53 EDT 2007


The search is trying to replace the spaces in our bookmarks (and the
links that go to those bookmarks)...

The bookmark tag looks like this:

<a href="Web_Sites.htm#A Web Sites">

and the bookmark tag looks like this

<a name="A Web Sites"></a>

some pitfalls I've already run up against...
SOMETIMES (but not often) the a and the href (or name) is split across
a line... this led me to just drop the "<a" from the front
If there are no spaces, SOME (but again, not all) of the "<a name"
tags don't have "'s... this is a problem because we're having to
replace all special characters with _'s...
Some of our bookmarks are quite wordy (we found one yesterday with 11
spaces)
href is sometimes all caps (HREF)

As you can imagine, there are alot of corner cases and I felt it was
easier just to be inefficient and write out all the regex cases and
loop through them repeatedly... I've also got to work around the stuff
already in the system (for example, I need to make certain I'm looking
behind the #'s in the bookmark links, otherwise I'll end up replacing
legitimate -'s in external web site addresses)

I think Pablo is correct that a single (or perhaps two) RE statements
are all that is needed... perhaps:

p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<: )+(.*)(">)+')
and the corresponding name replace and then the one corner case we ran
into of
p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')




More information about the Python-list mailing list