Multiline regex help

Thu Mar 3 15:26:37 EST 2005

Have a look at "martel", part of biopython. The world of bioinformatics is 
filled with files with structure like this.

http://www.biopython.org/docs/api/public/Martel-module.html

James

On Thursday 03 March 2005 12:03 pm, Yatima wrote:
> On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard 
<steven.bethard at gmail.com> wrote:
> > A possible solution, using the re module:
> >
> > py> s = """\
> > ... Gibberish
> > ... 53
> > ... MoreGarbage
> > ... 12
> > ... RelevantInfo1
> > ... 10/10/04
> > ... NothingImportant
> > ... ThisDoesNotMatter
> > ... 44
> > ... RelevantInfo2
> > ... 22
> > ... BlahBlah
> > ... 343
> > ... RelevantInfo3
> > ... 23
> > ... Hubris
> > ... Crap
> > ... 34
> > ... """
> > py> import re
> > py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
> > ...                    .*
> > ...                    ^RelevantInfo2\n([^\n]*)
> > ...                    .*
> > ...                    ^RelevantInfo3\n([^\n]*)""",
> > ...                re.DOTALL | re.MULTILINE | re.VERBOSE)
> > py> score = {}
> > py> for info1, info2, info3 in m.findall(s):
> > ...     score.setdefault(info1, {})[info3] = info2
> > ...
> > py> score
> > {'10/10/04': {'23': '22'}}
> >
> > Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
> > to have ^ apply at the start of each line, and VERBOSE to allow me to
> > write the re in a more readable form.
> >
> > If I didn't get your dict update quite right, hopefully you can see how
> > to fix it!
>
> Thanks! That was very helpful. Unfortunately, I wasn't completely clear
> when describing the problem. Is there anyway to extract multiple scores
> from the same file and from multiple files (I will probably use the
> "fileinput" module to deal with multiple files). So, if I've got say:
>
> Gibberish
> 53
> MoreGarbage
> 12
> RelevantInfo1
> 10/10/04
> NothingImportant
> ThisDoesNotMatter
> 44
> RelevantInfo2
> 22
> BlahBlah
> 343
> RelevantInfo3
> 23
> Hubris
> Crap
> 34
>
> SecondSetofGarbage
> 2423
> YouGetThePicture
> 342342
> RelevantInfo1
> 10/10/04
> HoHum
> 343
> MoreStuffNotNeeded
> 232
> RelevantInfo2
> 33
> RelevantInfo3
> 44
> sdfsdf
> RelevantInfo1
> 10/11/04
> InsertBoringFillerHere
> 43234
> Stuff
> MoreStuff
> RelevantInfo2
> 45
> ExcitingIsntIt
> 324234
> RelevantInfo3
> 60
> Lalala
>
> Sorry for the long and painful example input. Notice that the first two
> "RelevantInfo1" fields have the same info but that the RelevantInfo2 and
> RelevantInfo3 fields have different info. Also, there will be cases where
> RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
> hoping for is something along then lines of being able to organize it like
> so (don't worry about the format of the output -- I'll deal with that
> later; "RelevantInfo" shortened to "Info" for readability):
>
>             Info1[0],                   Info[1],                    Info[2]
> ... Info3[0]    Info2[Info1[0],Info3[0]]    Info2[Info1[1],Info3[1]]    ...
> Info3[1]    Info2[Info1[0],Info3[1]]    ...
> Info3[2]    Info2[Info1[0],Info3[2]]    ...
> ...
>
> I don't really care if it's a list, dictionary, array etc.
>
> Thanks again for your help. The multiline option in the re module is very
> useful.
>
> Take care.
>
> --
> Clarke's Conclusion:
> 	Never let your sense of morals interfere with doing the right thing.

-- 
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095