Weird problem matching with REs

John S jstrickler at gmail.com
Sun May 29 11:48:34 EDT 2011


On May 29, 10:35 am, Andrew Berg <bahamutzero8... at gmail.com> wrote:
> On 2011.05.29 09:18 AM, Steven D'Aprano wrote:> >> What makes you think it shouldn't match?
>
> > > AFAIK, dots aren't supposed to match carriage returns or any other
> > > whitespace characters.
>
> I got things mixed up there (was thinking whitespace instead of
> newlines), but I thought dots aren't supposed to match '\r' (carriage
> return). Why is '\r' not considered a newline character?

Dots don't match end-of-line-for-your-current-OS is how I think of
it.

While I almost usually nod my head at Steven D'Aprano's comments, in
this case I have to say that if you just want to grab something from a
chunk of HTML, full-blown HTML parsers are overkill. True, malformed
HTML can throw you off, but they can also throw a parser off.

I could not make your regex work on my Linux box with Python 2.6.

In your case, and because x264 might change their HTML, I suggest the
following code, which works great on my system.YMMV. I changed your
newline matches to use \s and put some capturing parentheses around
the date, so you could grab it.

>>> import urllib2
>>> import re
>>>
>>> content = urllib2.urlopen("http://x264.nl/x264_main.php").read()
>>>
>>> rx_x264version= re.compile(r"http://x264\.nl/x264/64bit/8bit_depth/revision\s*(\d{4})\s*/x264\s*\.exe")
>>>
>>> m = rx_x264version.search(content)
>>> if m:
...     print m.group(1)
...
1995
>>>


\s is your friend -- matches space, tab, newline, or carriage return.
\s* says match 0 or more spaces, which is what's needed here in case
the web site decides to *not* put whitespace in the middle of a URL...

As Steven said, when you want match a dot, it needs to be escaped,
although it will work by accident much of the time. Also, be sure to
use a raw string when composing REs, so you don't run into backslash
issues.

HTH,
John Strickler



More information about the Python-list mailing list