Regular Expressions: Can't quite figure this problem out

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Tue Sep 25 00:23:30 EDT 2007


En Mon, 24 Sep 2007 23:51:57 -0300, Robert Dailey <rcdailey at gmail.com>  
escribi�:

> What I meant was that it's not an option because I'm trying to learn  
> regular
> expressions. RE is just as built in as anything else.

Ok, let's analyze what you want. You have for instance this text:
"<action></action>"
which should become
"<action/>"

You have to match:
(opening angle bracket)(any word)(closing angle bracket)(opening angle  
bracket)(slash)(same word as before)(closing angle bracket)

This translates rather directly into this regular expression:

r"<(\w+)></\1>"

where \w+ means "one or more alphanumeric characters or _", and being  
surrounded in () creates a group (group number one), which is  
back-referenced as \1 to express "same word as before"
The matched text should be replaced by (opening <)(the word  
found)(slash)(closing >), that is: r"<\1/>"
Using the sub function in module re:

py> import re
py> source = """
... <root></root>
... <root/>
... <root><frame type="image"><action></action></frame></root>
... <root><frame type="image"><action/></frame></root>
... """
py> print re.sub(r"<(\w+)></\1>", r"<\1/>", source)

<root/>
<root/>
<root><frame type="image"><action/></frame></root>
<root><frame type="image"><action/></frame></root>

Now, a more complex example, involving tags with attributes:
<frame type="image"></frame>  -->  <frame type="image" />

You have to match:
(opening angle bracket)(any word)(any sequence of words,spaces,other  
symbols,but NOT a closing angle bracket)(closing angle bracket)(opening  
angle bracket)(slash)(same word as before)(closing angle bracket)

r"<(\w+)([^>]*)></\1>"

[^>] means "anything but a >", the * means "may occur many times, maybe  
zero", and it's enclosed in () to create group 2.

py> source = """
... <root></root>
... <root><frame type="image"></frame></root>
... """
py> print re.sub(r"<(\w+)([^>]*)></\1>", r"<\1\2 />", source)

<root />
<root><frame type="image" /></root>

Next step would be to allow whitespace wherever it is legal to appear -  
left as an exercise to the reader. Hint: use \s*

-- 
Gabriel Genellina




More information about the Python-list mailing list