RegExp performance?
John Machin
sjmachin at lexicon.net
Mon Feb 26 02:42:33 EST 2007
On Feb 26, 2:01 pm, Kirk Sluder <k... at nospam.jobsluder.net> wrote:
> In article <45e1d367$0$90273$14726... at news.sunsite.dk>,
> Christian Sonne <FreakC... at gmail.com> wrote:
>
> > Thanks to all of you for your replies - they have been most helpful, and
> > my program is now running at a reasonable pace...
>
> > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
> > turns out to misbehave in further testing, I'll know where to turn :-P
>
> Anything with variable-length wildcard matching (*+?) is going to
> drag your performance down. There was an earlier thread on this very
> topic. Another stupid question is how are you planning on handling
> ISBNs formatted with hyphens for readability?
According to the OP's first message, 2nd paragraph:
"""
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
"""
Given a low density of ISBNs in the text, it may well be better to
avoid the preliminary pass to rip out the '-'s, and instead:
1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
inside the number)
2. post-process the matches: strip out any '-'s, check for remaining
length == 10.
Another thought for the OP: Consider (irrespective of how you arrive
at a candidate ISBN) validating the ISBN check-digit.
Cheers,
John
More information about the Python-list
mailing list