Regular expressions, help?

Jussi Piitulainen jpiitula at ling.helsinki.fi
Thu Apr 19 09:21:00 EDT 2012


Sania writes:

> On Apr 19, 2:48 am, Jussi Piitulainen <jpiit... at ling.helsinki.fi>
> wrote:
> > Sania writes:
> > > So I am trying to get the number of casualties in a text. After 'death
> > > toll' in the text the number I need is presented as you can see from
> > > the variable called text. Here is my code
> > > I'm pretty sure my regex is correct, I think it's the group part
> > > that's the problem.
> > > I am using nltk by python. Group grabs the string in parenthesis and
> > > stores it in deadnum and I make deadnum into a list.
> >
> > >  text="accounts put the death toll at 637 and those missing at
> > > 653 , but the total number is likely to be much bigger"
> > >       dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
> > >       deadnum=dead.group(1)
> > >       deaths.append(deadnum)
> > >       print deaths
> >
> > It's the regexp. The .* after "death toll" each the input as far as it
> > can without making the whole match fail. The group matches only the
> > last digit in the text.
> >
> > You could allow only non-digits before the number. Or you could look
> > up the variant of * that only matches as much as it must.
> 
> Hey Thanks,
> So now my regex is
> 
>     dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)
> 
> But I only find 7 not 657. How is it that the group is only matching
> the last digit? The whole thing is parenthesis not just the last
> part. ?

It's still consuming the digits among the text that comes _before_ the
parenthesised group: the .{0,20} matches as _much_ as it _can_ without
making the whole regex fail, and the . in it matches also digits.

Try \D{0,20} to limit its matching ability to non-digits.

Try \.{0,20}? to limit to it to matching as _little_ as it can.

(The variant of * I referred to is *?; {} and {}? are similar.)

The simplicity of regexen is deceptive. Be careful. Be surprised.
<http://docs.python.org/library/re.html>. Keep them simple. Consider
also other means instead or in addition.



More information about the Python-list mailing list