[regexp] Where's the error in this ini-file reading regexp?

F. GEIGER fgeiger at datec.at
Thu Nov 14 02:35:00 EST 2002


>  >>> rx = re.compile(r"^\s*(\[.+\]\s*$)|^\s*(.+?)=(.+)$",
re.MULTILINE|re.IGNORECASE)

Perfect, that's what I'm after. So the secret lies in that '|', right? Why
is this? No, that's the wrong question. I have to ask: How can I gain
knowledge more easily on this topic? How did you do? Are you on Linux, where
you more likely use re on a daily basis?

Anyway. thanks a lot for your post!

Kind regards
Franz



"Bengt Richter" <bokr at oz.net> schrieb im Newsbeitrag
news:aqunmr$iap$0 at 216.39.172.122...
> On Wed, 13 Nov 2002 20:05:15 +0100, "F. GEIGER" <fgeiger at datec.at> wrote:
>
> >Dear all,
> >
> >I have to parse a string which contains data in ini-file-format, i.e.:
> >
> >   s = \
> >'''
> >[Section 1]
> >Key11=Value11
> >Key12=Value12
> >Key13=Value13
> >[Section 2]
> >Key21=Value21
> >Key22=Value22
> >Key23=Value23
> >'''
> >
> >I decided to try a solution using re (yes, ConfigParser or simple string
> >splitting would be an other way to do it), because the structure really
is
> >regular: Sections, which contain key/value pairs.
> >
> >I'm sure there's only a tiny step left to succeed.
> >
> >I tried:
> >
> >rex = re.compile(r"(\s*(\[.+\])(\s*((.+)=(.+)))+?)",
> >re.MULTILINE|re.IGNORECASE)
> >L = rex.findall(s)
> >
> >where s is the string already shown above.
> >
> >What I get is:
> >
> >[('\n[Section 1]\nKey11=Value11', '[Section 1]', '\nKey11=Value11', \
> >'Key11=Value11', 'Key11', 'Value11'), ('\n[Section 2]\nKey21=Value21', \
> >'[Section 2]', '\nKey21=Value21', 'Key21=Value21', 'Key21', 'Value21')]
> >
> >So L[0][1] contains the string 'Section1', L[0][4] the string 'Key11',
> >L[0][5] the string 'Value11'.
> >Section2 is also there and is contained by L[1].
> >
> >So what I have is both sections, but only one (i.e. the first one)
key/value
> >pair for each of those two sections.
> >
> >When I remove the last '?' in ...(.+=.+))+?)" then the last key/value
pair
> >instead of the first one is the result. So this must have to do with
> >greediness/non-greediness. What I wonder is, why do I not get all three
> >key/value pairs? How can I tell the re engine "gimme *all* groups having
> >key=value form", when a '+' delivers only the last, and '+?' only the
first
> >of those pairs?
>
> Assuming you want to do the whole thing with an re, maybe this would be
more useful?
> (the "if 1:" was so I could copy/paste without reindenting)
>
>  >>> if 1:
>  ...    s = \
>  ... '''
>  ... [Section 1]
>  ... Key11=Value11
>  ... Key12=Value12
>  ... Key13=Value13
>  ... [Section 2]
>  ... Key21=Value21
>  ... Key22=Value22
>  ... Key23=Value23
>  ... '''
>  ...
>  >>> import re
>  >>> rx = re.compile(r"^\s*(\[.+\]\s*$)|^\s*(.+?)=(.+)$",
re.MULTILINE|re.IGNORECASE)
>  >>> rx.findall(s)
>  [('[Section 1]', '', ''), ('', 'Key11', 'Value11'), ('', 'Key12',
'Value12'), ('', 'Key13', 'Val
>  ue13'), ('[Section 2]', '', ''), ('', 'Key21', 'Value21'), ('', 'Key22',
'Value22'), ('', 'Key23
>  ', 'Value23')]
>
> or, easier to see:
>
>  >>> for t in rx.findall(s): print t
>  ...
>  ('[Section 1]', '', '')
>  ('', 'Key11', 'Value11')
>  ('', 'Key12', 'Value12')
>  ('', 'Key13', 'Value13')
>  ('[Section 2]', '', '')
>  ('', 'Key21', 'Value21')
>  ('', 'Key22', 'Value22')
>  ('', 'Key23', 'Value23')
>
> which should be easy to process sequentially, checking for new section
names in t[0],
> and otherwise having t[1] and t[2]. BTW t[1] made non-greedy should check
for the
> first '='. You might want to check if spaces should be included in the end
of a key and
> beginning of a value, otherwise you might want \s*=\s* where = is now.
Similarly with $
> if you want to trim possible trailing blanks and use \s*$ (I kept your
leading blank trim).
>
>
>
>
> Regards,
> Bengt Richter





More information about the Python-list mailing list