Regular Expressions: Can't quite figure this problem out

Robert Dailey rcdailey at gmail.com
Tue Sep 25 10:50:44 EDT 2007


Awesome description. This was more than helpful. I'm really grateful that
you took the time to outline that for me. I really understand it now.
However, as I mentioned in the lxml mailing list, I'm starting to learn more
towards regular expressions being a very LAST resort to solving problems
like this. In my specific case, I have a better choice which is the etree
parser. It does all of this for me (as you so kindly stated before). I hope
this is the correct attitude to have. Being a C++ developer, I normally
don't admire unmanageable and unreadable code (this is especially true with
regular expressions). They're very useful, but again I believe it should be
a last resort.

Thanks again for your help.

On 9/24/07, Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
>
> En Mon, 24 Sep 2007 23:51:57 -0300, Robert Dailey <rcdailey at gmail.com>
> escribi�:
>
> > What I meant was that it's not an option because I'm trying to learn
> > regular
> > expressions. RE is just as built in as anything else.
>
> Ok, let's analyze what you want. You have for instance this text:
> "<action></action>"
> which should become
> "<action/>"
>
> You have to match:
> (opening angle bracket)(any word)(closing angle bracket)(opening angle
> bracket)(slash)(same word as before)(closing angle bracket)
>
> This translates rather directly into this regular expression:
>
> r"<(\w+)></\1>"
>
> where \w+ means "one or more alphanumeric characters or _", and being
> surrounded in () creates a group (group number one), which is
> back-referenced as \1 to express "same word as before"
> The matched text should be replaced by (opening <)(the word
> found)(slash)(closing >), that is: r"<\1/>"
> Using the sub function in module re:
>
> py> import re
> py> source = """
> ... <root></root>
> ... <root/>
> ... <root><frame type="image"><action></action></frame></root>
> ... <root><frame type="image"><action/></frame></root>
> ... """
> py> print re.sub(r"<(\w+)></\1>", r"<\1/>", source)
>
> <root/>
> <root/>
> <root><frame type="image"><action/></frame></root>
> <root><frame type="image"><action/></frame></root>
>
> Now, a more complex example, involving tags with attributes:
> <frame type="image"></frame>  -->  <frame type="image" />
>
> You have to match:
> (opening angle bracket)(any word)(any sequence of words,spaces,other
> symbols,but NOT a closing angle bracket)(closing angle bracket)(opening
> angle bracket)(slash)(same word as before)(closing angle bracket)
>
> r"<(\w+)([^>]*)></\1>"
>
> [^>] means "anything but a >", the * means "may occur many times, maybe
> zero", and it's enclosed in () to create group 2.
>
> py> source = """
> ... <root></root>
> ... <root><frame type="image"></frame></root>
> ... """
> py> print re.sub(r"<(\w+)([^>]*)></\1>", r"<\1\2 />", source)
>
> <root />
> <root><frame type="image" /></root>
>
> Next step would be to allow whitespace wherever it is legal to appear -
> left as an exercise to the reader. Hint: use \s*
>
> --
> Gabriel Genellina
>
> --
> http://mail.python.org/mailman/listinfo/python-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20070925/f1381eec/attachment.html>


More information about the Python-list mailing list