Multiline regex help

Steven Bethard steven.bethard at gmail.com
Thu Mar 3 15:45:31 EST 2005


Yatima wrote:
> On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard <steven.bethard at gmail.com> wrote:
> 
>>A possible solution, using the re module:
>>
>>py> s = """\
>>... Gibberish
>>... 53
>>... MoreGarbage
>>... 12
>>... RelevantInfo1
>>... 10/10/04
>>... NothingImportant
>>... ThisDoesNotMatter
>>... 44
>>... RelevantInfo2
>>... 22
>>... BlahBlah
>>... 343
>>... RelevantInfo3
>>... 23
>>... Hubris
>>... Crap
>>... 34
>>... """
>>py> import re
>>py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
>>...                    .*
>>...                    ^RelevantInfo2\n([^\n]*)
>>...                    .*
>>...                    ^RelevantInfo3\n([^\n]*)""",
>>...                re.DOTALL | re.MULTILINE | re.VERBOSE)
>>py> score = {}
>>py> for info1, info2, info3 in m.findall(s):
>>...     score.setdefault(info1, {})[info3] = info2
>>...
>>py> score
>>{'10/10/04': {'23': '22'}}
>>
>>Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE 
>>to have ^ apply at the start of each line, and VERBOSE to allow me to 
>>write the re in a more readable form.
>>
>>If I didn't get your dict update quite right, hopefully you can see how 
>>to fix it!
> 
> 
> Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
> describing the problem. Is there anyway to extract multiple scores from the
> same file and from multiple files

I think if you use the non-greedy .*? instead of the greedy .*, you'll 
get this behavior.  For example:

py> s = """\
... Gibberish
... 53
... MoreGarbage
[snip a whole bunch of stuff]
... RelevantInfo3
... 60
... Lalala
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
...                    .*?
...                    ^RelevantInfo2\n([^\n]*)
...                    .*?
...                    ^RelevantInfo3\n([^\n]*)""",
...                re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
...     score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}

If you might have multiple info2 values for the same (info1, info3) 
pair, you can try something like:

py> score = {}
py> for info1, info2, info3 in m.findall(s):
...     score.setdefault(info1, {}).setdefault(info3, []).append(info2)
...
py> score
{'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}

HTH,

STeVe



More information about the Python-list mailing list