How to make regexes faster? (Python v. OmniMark)

Van Gale cgale1 at cox.net
Sat Apr 20 09:45:43 EDT 2002


"Donn Cave" wrote:
> By the way, I'll second Johannes Stiehler's recommendation of
> MxTextTools.  Definitely appropriate for SGML parsing, and much
> better than regexps for extensive parsing in my opinion - not
> just in terms of speed, but I suspect a more powerful way to
> describe text patterns than regexps.

I agree that mxTextTools are excellent but sometimes, like now, I can't keep
my yapper shut when people talk about parsing SGML with regexps :)  If SGML
could be parsed that way then XML never would have been created.  You cannot
parse SGML without completely parsing and supporting all the features in the
DTD that accompany an SGML document (a DTD is mandatory).  For example, in
SGML tags don't have to be enclosed in '<' and '>'.  You can redefine those
in the DTD.  If you are sending or receiving SGML documents from systems
outside your control you better use a big bucks tool like OmniMark,
otherwise you'll end up with egg on your face when those documents don't
validate.  If all the systems are under your control and running your own
programs, then by all means save some money and use regular expressions
tailored to your documents, just don't call them valid SGML ... call them
XML :)

Van






More information about the Python-list mailing list