Regular Expressions: Can't quite figure this problem out

Robert Dailey rcdailey at gmail.com
Tue Sep 25 11:44:35 EDT 2007


Fortunately I don't have any XML that complex, however you make a good
point.

On 9/25/07, Paul McGuire <ptmcg at austin.rr.com> wrote:
>
> On Sep 24, 11:23 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> wrote:
> > En Mon, 24 Sep 2007 23:51:57 -0300, Robert Dailey <rcdai... at gmail.com>
> > escribi?:
> >
> > > What I meant was that it's not an option because I'm trying to learn
> > > regular
> > > expressions. RE is just as built in as anything else.
> >
> > Ok, let's analyze what you want. You have for instance this text:
> > "<action></action>"
> > which should become
> > "<action/>"
> >
> > You have to match:
> > (opening angle bracket)(any word)(closing angle bracket)(opening angle
> > bracket)(slash)(same word as before)(closing angle bracket)
> >
> > This translates rather directly into this regular expression:
> >
> > r"<(\w+)></\1>"
> >
> > where \w+ means "one or more alphanumeric characters or _", and being
> > surrounded in () creates a group (group number one), which is
> > back-referenced as \1 to express "same word as before"
> > The matched text should be replaced by (opening <)(the word
> > found)(slash)(closing >), that is: r"<\1/>"
> > Using the sub function in module re:
> >
> > py> import re
> > py> source = """
> > ... <root></root>
> > ... <root/>
> > ... <root><frame type="image"><action></action></frame></root>
> > ... <root><frame type="image"><action/></frame></root>
> > ... """
> > py> print re.sub(r"<(\w+)></\1>", r"<\1/>", source)
> >
> > <root/>
> > <root/>
> > <root><frame type="image"><action/></frame></root>
> > <root><frame type="image"><action/></frame></root>
> >
> > Now, a more complex example, involving tags with attributes:
> > <frame type="image"></frame>  -->  <frame type="image" />
> >
> > You have to match:
> > (opening angle bracket)(any word)(any sequence of words,spaces,other
> > symbols,but NOT a closing angle bracket)(closing angle bracket)(opening
> > angle bracket)(slash)(same word as before)(closing angle bracket)
> >
> > r"<(\w+)([^>]*)></\1>"
> >
> > [^>] means "anything but a >", the * means "may occur many times, maybe
> > zero", and it's enclosed in () to create group 2.
> >
> > py> source = """
> > ... <root></root>
> > ... <root><frame type="image"></frame></root>
> > ... """
> > py> print re.sub(r"<(\w+)([^>]*)></\1>", r"<\1\2 />", source)
> >
> > <root />
> > <root><frame type="image" /></root>
> >
> > Next step would be to allow whitespace wherever it is legal to appear -
> > left as an exercise to the reader. Hint: use \s*
> >
> > --
> > Gabriel Genellina
>
> And let's hope the OP doesn't have to parse anything truly nasty like:
>
> <esolang:language name="Python" interpreter_prompt=">>>"></
> esolang:language>
>
> -- Paul
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20070925/d0ea5db3/attachment.html>


More information about the Python-list mailing list