Regular expressions, help?
azrazer
azra at glop.com
Thu Apr 19 09:15:22 EDT 2012
Le 19/04/2012 14:02, Sania a écrit :
> On Apr 19, 2:48 am, Jussi Piitulainen<jpiit... at ling.helsinki.fi>
[...]
>>> text="accounts put the death toll at 637 and those missing at
>>> 653 , but the total number is likely to be much bigger"
>>> dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
>>> deadnum=dead.group(1)
>>> deaths.append(deadnum)
>>> print deaths
>>
>> It's the regexp. The .* after "death toll" each the input as far as it
>> can without making the whole match fail. The group matches only the
>> last digit in the text.
>>
>> You could allow only non-digits before the number. Or you could look
>> up the variant of * that only matches as much as it must.
>
> Hey Thanks,
> So now my regex is
>
> dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)
Hi,
But there, your regex matches :
<something>death toll<anything which length is <=20> followed by what
you capture (which is made up of a digit, at least)
there are at least two issues here :
- the number of characters between death toll and the figure may be > 20
- your {0,20} is greedy => .{0,20} matches as many as "." as it can
AND one digit is matched by (\d[,\d\.]*), since your group captures a
digit followed(OR NOT) by a digit, a comma, a dot
=====> so " at 63" is sucked by .{0,20} and (\d[,\d\.]*) matches
the remaining digit "7"
a solution would be to follow what Jussi suggested...
=> dead=re.match(r".*death toll\D*(\d*)", text)
>
> But I only find 7 not 657. How is it that the group is only matching
> the last digit?
=> .{,20} greed
> The whole thing is parenthesis not just the last part. ?
yeah but only one digit remains when your group matches...
Good luck understanding regexes, it's a powerful tool ! :)
best,
azra.
More information about the Python-list
mailing list