parsing in python

Wed Jun 9 13:36:22 EDT 2004

"Peter Sprenger" <sprenger at moving-bytes.de> wrote in message
news:ca6ep3$8ni$01$1 at news.t-online.com...
> Hello,
>
> I hope somebody can help me with my problem. I am writing Zope python
> scripts that will do parsing on text for dynamic webpages: I am getting
> a text from an oracle database that contains different tags that have to
> be converted to a HTML expression. E.g. "<pic#>" ( # is an integer
> number) has to be converted to <img src="..."> where the image data
> comes also from a database table.
> Since strings are immutable, is there an effective way to parse such
> texts in Python? In the process of finding and converting the embedded
> tags I also would like to make a word wrap on the generated HTML output
> to increase the readability of the generated HTML source.
> Can I write an efficient parser in Python or should I extend Python with
> a C routine that will do this task in O(n)?
>
> Regards
>
> Peter Sprenger

Peter -

Not sure how this holds up to "high-performance" requirements, but this
should work as a prototype until you need something better.  (Requires
download of latest pyparsing 1.2beta3, at http://pyparsing.sourceforge.net
.)  Note that this grammar is tolerant of upper or lowercase PIC, plus
inclusion of whitespace between tokens and tag attributes within the
<pic###> tag.

BTW, I'll be the first one to admit that this is a lot wordier (and very
possibly slower) than something like re.sub().  But it is *very* productive
from a programming standpoint, and implicitly takes care of nuisance issues
like unexpected whitespace.  It is also simple from a maintenance
standpoint: adding support for caseless matching on 'pic', or for additional
tag attributes, was very straightforward.  I'm just not that good with re's
to be able to make similar changes in as short a time, or as readable a
style.

(While we are talking about performance, I'll also mention that
transformString() does not use string concatenation to construct its output.
As the input is processed, the transformed text fragments and intervening
original text are accumulated into a list; at the end, the list is converted
to a string using "".join(). )

-- Paul

===================
from pyparsing import CharsNotIn,Word,Literal,Optional,CaselessLiteral

testdata = """
<HTML>
<BODY>
<pic38>
<pic22 align="left">
< PIC 17 >

< pic99 >
</BODY></HTML>
"""

# Define parse action to convert <pic###> tags to <img src=...> tags
def convertPicNumToImgSrc(src,loc,toks):
    imgFile = imageFiles.get( toks.picnum, "default.jpg" )
    retstring = '<img src="%s"%s>' % (imgFile, toks.picAttribs)
    return retstring

# Define grammar for matching text pattern - don't forget that there might
be HTML tag attributes
# included in the <pic###> tag
# Return parse results as:
#    picnum  - the numeric part of the <pic###> tag, converted to an integer
#    picAttribs - optional HTML attributes that  might be defined in the
<pic###> tag
#
integer = Word("0123456789").setParseAction( lambda s,l,t: int(t[0]) )
picTagDefn = ( Literal("<") +
               CaselessLiteral("pic") +
               integer.setResultsName("picnum") +
               Optional( CharsNotIn(">") ).setResultsName("picAttribs") +
               ">").setParseAction( convertPicNumToImgSrc )

# Set up lookup table of pic #'s to image file names
# (in reality, these would be read from database table)
imageFiles = {
    22 : "flower.jpg",
    17 : "house.jpg",
    38 : "dog.jpg",
    }

# Run transformString
print picTagDefn.transformString(testdata)

===================
output:

<HTML>
<BODY>
<img src="dog.jpg">
<img src="flower.jpg" align="left">
<img src="house.jpg" >

<img src="default.jpg" >
</BODY></HTML>