a re question

Mon Sep 9 20:38:52 EDT 2002

Rajarshi Guha wrote:

>Hi,
>  I have a file with lines of the format:
>
>001 Abc D Efg 123456789   7 100 09/05/2002 20:23:23
>001 Xya FGh   143557789   7 100 09/05/2002 20:23:23
>
>I am trying to extract the 9 digit field and the single digit field
>immediatley after that.
>
Regex is great but on the surface it appears to be overkill for your 
application.

I would like to suggest some alternatives not using regex.

(A)  IF all the fields are fixed width (up to and including the fields 
of interest, but not necessarily the ones  following) then you can 
extract sub fields by simple indexing into the string.

E.g., assuming a single space or TAB for a separator and that variable 
'line' contains one of the above data lines, then something like

    line[14:23]

would extract the larger numeric field.  (I may have counted wrong -- 
you may have to debug that fragment before using it.)

(B) If the fields are variable width (as your regex suggests) BUT always 
separated by spaces or tabs, you can simply split the line into fields:

    fields = line.split()

and then,

    fields[4] and fields[5]

would contain the nonwhite space contents of your desired numeric 
fields.  The split function (in the string module) takes an optional 
argument to specify separators (e.g., commas) other than whitespace.

I expect these alternatives would be faster than regex, though I have 
not measured to make sure.

If I'm mistaken and the fields are all run together, without whitespace 
separators, then you're stuck with regex.  However, then your existing 
expressions likely need more work to work right in that case.

"There's more than one way to do it!"

Regards

--jb

-- 
James J. Besemer		503-280-0838 voice
2727 NE Skidmore St.		503-280-0375 fax
Portland, Oregon 97211-6557	mailto:jb at cascade-sys.com
				http://cascade-sys.com