[regexp] Where's the error in this ini-file reading regexp?

Wed Nov 13 18:36:59 EST 2002

On Wed, 13 Nov 2002 20:05:15 +0100, "F. GEIGER" <fgeiger at datec.at> wrote:

>Dear all,
>
>I have to parse a string which contains data in ini-file-format, i.e.:
>
>   s = \
>'''
>[Section 1]
>Key11=Value11
>Key12=Value12
>Key13=Value13
>[Section 2]
>Key21=Value21
>Key22=Value22
>Key23=Value23
>'''
>
>I decided to try a solution using re (yes, ConfigParser or simple string
>splitting would be an other way to do it), because the structure really is
>regular: Sections, which contain key/value pairs.
>
>I'm sure there's only a tiny step left to succeed.
>
>I tried:
>
>rex = re.compile(r"(\s*(\[.+\])(\s*((.+)=(.+)))+?)",
>re.MULTILINE|re.IGNORECASE)
>L = rex.findall(s)
>
>where s is the string already shown above.
>
>What I get is:
>
>[('\n[Section 1]\nKey11=Value11', '[Section 1]', '\nKey11=Value11', \
>'Key11=Value11', 'Key11', 'Value11'), ('\n[Section 2]\nKey21=Value21', \
>'[Section 2]', '\nKey21=Value21', 'Key21=Value21', 'Key21', 'Value21')]
>
>So L[0][1] contains the string 'Section1', L[0][4] the string 'Key11',
>L[0][5] the string 'Value11'.
>Section2 is also there and is contained by L[1].
>
>So what I have is both sections, but only one (i.e. the first one) key/value
>pair for each of those two sections.
>
>When I remove the last '?' in ...(.+=.+))+?)" then the last key/value pair
>instead of the first one is the result. So this must have to do with
>greediness/non-greediness. What I wonder is, why do I not get all three
>key/value pairs? How can I tell the re engine "gimme *all* groups having
>key=value form", when a '+' delivers only the last, and '+?' only the first
>of those pairs?

Assuming you want to do the whole thing with an re, maybe this would be more useful?
(the "if 1:" was so I could copy/paste without reindenting)

 >>> if 1:
 ...    s = \
 ... '''
 ... [Section 1]
 ... Key11=Value11
 ... Key12=Value12
 ... Key13=Value13
 ... [Section 2]
 ... Key21=Value21
 ... Key22=Value22
 ... Key23=Value23
 ... '''
 ...
 >>> import re
 >>> rx = re.compile(r"^\s*(\[.+\]\s*$)|^\s*(.+?)=(.+)$", re.MULTILINE|re.IGNORECASE)
 >>> rx.findall(s)
 [('[Section 1]', '', ''), ('', 'Key11', 'Value11'), ('', 'Key12', 'Value12'), ('', 'Key13', 'Val
 ue13'), ('[Section 2]', '', ''), ('', 'Key21', 'Value21'), ('', 'Key22', 'Value22'), ('', 'Key23
 ', 'Value23')]

or, easier to see:

 >>> for t in rx.findall(s): print t
 ...
 ('[Section 1]', '', '')
 ('', 'Key11', 'Value11')
 ('', 'Key12', 'Value12')
 ('', 'Key13', 'Value13')
 ('[Section 2]', '', '')
 ('', 'Key21', 'Value21')
 ('', 'Key22', 'Value22')
 ('', 'Key23', 'Value23')

which should be easy to process sequentially, checking for new section names in t[0],
and otherwise having t[1] and t[2]. BTW t[1] made non-greedy should check for the
first '='. You might want to check if spaces should be included in the end of a key and
beginning of a value, otherwise you might want \s*=\s* where = is now. Similarly with $
if you want to trim possible trailing blanks and use \s*$ (I kept your leading blank trim).

Regards,
Bengt Richter