Regular expression problem

MRAB google at mrabarnett.plus.com
Mon Jun 23 11:14:08 EDT 2008


On Jun 22, 10:13 pm, abranches <pedrof.abranc... at gmail.com> wrote:
> Hello everyone.
>
> I'm having a problem when extracting data from HTML with regular
> expressions.
> This is the source code:
>
> You are ready in the next<br /><span id="counter_jt_minutes"
> style="display: inline;"><span id="counter_jt_minutes_value">12</
> span>M</span> <span id="counter_jt_seconds" style="display:
> inline;"><span id="counter_jt_seconds_value">48</span>S</span>
>
> And I need to get the remaining time. Until here, isn't a problem
> getting it, but if the remaining time is less than 60 seconds then the
> source becomes something like this:
>
> You are ready in the next<br /><span id="counter_jt_seconds"
> style="display: inline;"><span id="counter_jt_seconds_value">36</
> span>S</span>
>
> I'm using this regular expression, but the minutes are always None...
> You are ready in the next.*?(?:>(\d+)</span>M</span>)?.*?(?:>(\d+)</
> span>S</span>)
>
> If I remove the ? from the first group, then it will work, but if
> there are only seconds it won't work.
> I could resolve this problem in a couple of python lines, but I really
> would like to solve it with regular expressions.
>
Your regex is working like this:

1. Match 'You are ready in the next'.
2. Match an increasing number of characters, starting with none
('.*?').
3. Try to match a pattern ('(?:>...)?') from where the previous step
left off. This doesn't match, but it's optional anyway, so continue to
the next step. (No characters consumed.)
4. Match an increasing number of characters, starting from none
('.*?'). It's this step that consumes the minutes.

It then goes on to match the seconds, and the minutes are always None
as you've found.

I've come up with this regex:

You are ready in the next(?:.*?>(\d+)</span>M</span>)?(?:.*?>(\d+)</
span>S</span>)

Hope that helps.



More information about the Python-list mailing list