f*cking re module

Tue Jul 5 13:06:28 EDT 2005

Your elaboration on what problem you are actually trying to solve gave
me some additional insights into your question.  It looks like you are
writing a Python-HTML templating system, by embedding Python within
HTML using <python>...</python> tags.

As many may have already guessed, I worked up a pyparsing treatment of
your problem.  As part of the implementation, I reinterpreted your
transformations slightly.  You said:

>>>I want to replace the <python> with " ",  </python>
>>>with "\n" and every thing that's not between the two
>>>python tags must begin with "\nprint \"\"\"" and
>>>end with "\"\"\"\n"

If this were an HTML page with <python> tags, it might look like:

<some HTML>
<python>
x = 1
</python>
<some more HTML>

The corresponding CGI python code would then read:
print """<some HTML>\n"""
x = 1
print """<some more HTML>\n"""

So we can reinterpret your transformation as:
1. From start of file to first <python> tag,
   enclose in print """<leading stuff>\n"""
2. From <python> tag to </python tag, print contents
3. From </python> tag to next <python> tag,
   enclose in print """<stuff between tags>\n"""
4. From last </python> tag to end of file,
   enclose in print """<ending stuff>\n"""

Or more formally:
<beginning of file>  -> 'print r"""'
<python> -> '"""\n'
<\python> -> 'print r"""'
<end of file> -> '"""\n'

Now that we have this defined, we can consider adding some standard
imports to the <beginning of file> transformation, such as "import
sys", etc.

Here is a working implementation.  The grammar itself is only about 10
lines of code, mostly in defining the replacement transforms.  The last
18 lines are the test case itself, printing the transformed string, and
then eval'ing the transformed string.

========================
# Take HTML that has <python> </python> tags interspersed, with python
code
# between the <python> tags.  Convert to running python cgi program.

# replace <python> with r'"""\n' and </python> with r'\nprint """'
# also put 'print """\ \n' at the beginning and '"""\n' at the end

from pyparsing import *

class OnlyOnce(object):
    def __init__(self, methodCall):
        self.callable = methodCall
        self.called = False
    def __call__(self,s,l,t):
        if not self.called:
            self.called = True
            return self.callable(s,l,t)
        raise ParseException(s,l,"")

stringStartText = """import sys
print "Content-Type: text/html\\n"
print r\"\"\""""
stringEndText = '"""\n'
startPythonText = '"""\n'
endPythonText = '\nprint r"""\n'

# define grammar
pythonStart = CaselessLiteral("<python>")
pythonEnd = CaselessLiteral("</python>")
sStart = StringStart()
sEnd = StringEnd()

sStart.setParseAction( OnlyOnce( replaceWith(stringStartText) ) )
sEnd.setParseAction( replaceWith(stringEndText) )
pythonStart.setParseAction( replaceWith(startPythonText) )
pythonEnd.setParseAction( replaceWith(endPythonText) )

xform = sStart | sEnd | pythonStart | pythonEnd

# run test case
htmlWithPython = r"""<HTML>
<HEAD>
<TITLE>Sample Page Created from Python</TITLE>
</HEAD>
<BODY>
<H1>Sample Page Created from Python</H1>
<python>
for i in range(10):
    print "This is line %d<br>" % i
</python>
</BODY>
</HTML>
"""

generatedPythonCode = xform.transformString( htmlWithPython )
print generatedPythonCode
print
exec(generatedPythonCode)
========================
Here is the output:
import sys
print "Content-Type: text/html\n"
print r"""<HTML>
<HEAD>
<TITLE>Sample Page Created from Python</TITLE>
</HEAD>
<BODY>
<H1>Sample Page Created from Python</H1>
"""

for i in range(10):
    print "This is line %d<br>" % i

print r"""

</BODY>
</HTML>
"""

Content-Type: text/html

<HTML>
<HEAD>
<TITLE>Sample Page Created from Python</TITLE>
</HEAD>
<BODY>
<H1>Sample Page Created from Python</H1>

This is line 0<br>
This is line 1<br>
This is line 2<br>
This is line 3<br>
This is line 4<br>
This is line 5<br>
This is line 6<br>
This is line 7<br>
This is line 8<br>
This is line 9<br>

</BODY>
</HTML>
========================

This exercise was interesting to me in that it uncovered some
unexpected behavior in pyparsing when matching on positional tokens (in
this case StringStart and StringEnd).  I learned that:
1. Since StringStart does not advance the parsing position in the
string, it is necessary to ensure that the parse action get run only
once, and then raise a ParseException on subsequent calls.  The little
class OnlyOnce takes care of this (I will probably fold OnlyOnce into
the next point release of pyparsing).
2. StringEnd is not well matched during scanString or transformString
if there is no trailing whitespace at the end of the input.  Even a
trailing \n is sufficient.  My first example of testdata ended with the
closing </HTML> tag, with no carriage return, and
scanString/transformString failed to match.  If I added a newline to
close the </HTML> tag, then scanString could find the StringEnd.  This
is not a terrible workaround, but it's another loose end to tie up in
the next release.

Enjoy!
-- Paul