a re question

Mon Sep 9 19:26:48 EDT 2002

On Mon, 09 Sep 2002 16:10:39 -0400, Rajarshi Guha <rajarshi at presidency.com> wrote:

>Hi,
>  I have a file with lines of the format:
>
>001 Abc D Efg 123456789   7 100 09/05/2002 20:23:23
>001 Xya FGh   143557789   7 100 09/05/2002 20:23:23
>
>I am trying to extract the 9 digit field and the single digit field
>immediatley after that.
>
>When I use Visual Regexp to try out the regexp 
>
>(\d{9,} {3,}\d)
>
>it highlights the 2 fields exactly. 
>
>But when I use the following Python code I get None:
>
>>> s='001 Abc D Efg 123456789   7 100 09/05/2002 20:23:23'
>>> p = re.compile(r'(\d{9,} {3,}\d)')
>>> print p.match(s)
>>> None
>
>Could anybody point out where I'm going wrong?
>
>Thanks,

 >>> import re
 >>> s='001 Abc D Efg 123456789   7 100 09/05/2002 20:23:23'
 >>> p = re.compile(r'(\d{9,} {3,}\d)')
 >>> print p.match(s)
 None
 >>> print p.search(s).groups()
 ('123456789   7',)

But if you want them separately,

 >>> p = re.compile(r'(\d{9,}) {3,}(\d)')
 >>> print p.search(s).groups()
 ('123456789', '7')

Or as actual integers,

 >>> map(int, p.search(s).groups())
 [123456789, 7]

match starts at the beginning of the string. See

    http://www.python.org/doc/current/lib/matching-searching.html

so for your pattern you might be able to prefix ".* " (i.e., anything ending in
a space before your 9 or more digits etc), e.g.,

 >>> p = re.compile(r'.* (\d{9,}) {3,}(\d)')
 >>> print p.match(s).groups()
 ('123456789', '7')

where s is still

 >>> s
 '001 Abc D Efg 123456789   7 100 09/05/2002 20:23:23'

BTW, you did want to get extra digits beyond 9 and *no* extra digits
in the second single digit number, right? E.g.,

 >>> s='001 Abc D Efg 1234567890   70 100 09/05/2002 20:23:23'
                         incl--^    ^--not incl
 >>> p = re.compile(r'(\d{9,}) {3,}(\d)')
 >>> print p.search(s).groups()
 ('1234567890', '7')
            ^    ^--single digit guaranteed irresp of next

Regards,
Bengt Richter