a regular expression question
Alex Martelli
aleax at aleax.it
Sat Mar 22 03:14:21 EST 2003
Luke wrote:
> I suppose this isn't really a python question as much a R.E. question,
> but I'm using python to do it, so... I'm trying to parse link data
> from a webpage that looks like this:
>
> <a href="foo1">1</a> abc <a href="foo2">2</a> def <a href="foo3">3</a>
> ghi <a href="foo4">4</a> jkl
Using RE's to parse HTML, when Python already offers wonderful tools
such as HTMLParser to do that, is rather absurd, of course.
> With a regular expression like below (where the variable 'text' is the
> sample above), re1 saves the numbers, but not the text. Why is that?
...
>>>> re1 = re.compile("<a .*?>([0-9]+?)</a>(.*?)")
I don't understand the question. You're matching one or more digits
quite explicitly with [0-9]+? -- why would you expect that to match
any non-digits? BTW, you may as well use + rather than +? here --
no difference in doing the repetition as non-greedy, since you're
then explicitly going for a non-digit. After the </a> you match
"zero or more of any character, non-greedy" and there ends -- so
of course the last group will always match zero characters.
If I divine correctly what you're trying to do, then:
"<a[^>]*>([0-9]+)</a>([^<]*)"
may come closer to your purposes. [^>]* means, zero or more
characters that aren't right-angle-brackets; and similarly
[^<]* means, zero or more that aren't left-angle-brackets.
But you're still better off using HTMLParser or the like.
Alex
More information about the Python-list
mailing list