parsing in python
Duncan Booth
me at privacy.net
Wed Jun 9 05:03:09 EDT 2004
Peter Sprenger <sprenger at moving-bytes.de> wrote in
news:ca6ep3$8ni$01$1 at news.t-online.com:
> I hope somebody can help me with my problem. I am writing Zope python
> scripts that will do parsing on text for dynamic webpages: I am getting
> a text from an oracle database that contains different tags that have to
> be converted to a HTML expression. E.g. "<pic#>" ( # is an integer
> number) has to be converted to <img src="..."> where the image data
> comes also from a database table.
> Since strings are immutable, is there an effective way to parse such
> texts in Python? In the process of finding and converting the embedded
> tags I also would like to make a word wrap on the generated HTML output
> to increase the readability of the generated HTML source.
> Can I write an efficient parser in Python or should I extend Python with
> a C routine that will do this task in O(n)?
You do realise that O(n) says nothing useful about how fast it will run?
Answering your other questions, yes, there are lots of effective ways to
parse text strings in Python. Were I in your position, I wouldn't even
consider C until I had demonstrated that the most obvious and clean
solution wasn't fast enough.
You don't really describe your data in sufficient detail, so I can only
give general suggestions:
You could use a regular expression replace to convert <pic#> tags with the
appropriate image tag.
you could use sgmllib to parse the data.
you could use one of Python's many xml parsers to parse the data (provided
it is valid xml, which it may not be).
you could use the split method on strings to split the data on '<'. Each
string (other than the first) then begins with a potential tag which you
can match with the startswith method or a regular expression.
You could replace '<' with '%(' and '>' with ')s' then use the % operator
to process all the replacements using a class with a custom __getitem__
method.
If you want to word wrap and pretty print the HTML, then that is better
done as a separate pass. Just get a general purpose HTML pretty printer
(e.g. mxTidy) and call it. That way you can easily turn it off for
production use if you really are concerned about speed.
More information about the Python-list
mailing list