RegExp performance?

John Machin sjmachin at lexicon.net
Mon Feb 26 02:42:33 EST 2007


On Feb 26, 2:01 pm, Kirk  Sluder <k... at nospam.jobsluder.net> wrote:
> In article <45e1d367$0$90273$14726... at news.sunsite.dk>,
>  Christian Sonne <FreakC... at gmail.com> wrote:
>
> > Thanks to all of you for your replies - they have been most helpful, and
> > my program is now running at a reasonable pace...
>
> > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
> > turns out to misbehave in further testing, I'll know where to turn :-P
>
> Anything with variable-length wildcard matching (*+?) is going to
> drag your performance down. There was an earlier thread on this very
> topic.  Another stupid question is how are you planning on handling
> ISBNs formatted with hyphens for readability?

According to the OP's first message, 2nd paragraph:
"""
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
"""

Given a low density of ISBNs in the text, it may well be better to
avoid the preliminary pass to rip out the '-'s, and instead:

1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
inside the number)

2. post-process the matches: strip out any '-'s, check for remaining
length == 10.

Another thought for the OP: Consider (irrespective of how you arrive
at a candidate ISBN) validating the ISBN check-digit.

Cheers,
John




More information about the Python-list mailing list