Regular Expressions

Mon Feb 12 13:01:46 EST 2007

En Mon, 12 Feb 2007 07:20:11 -0300, deviantbunnylord at gmail.com  
<deviantbunnylord at gmail.com> escribió:

> The source of HTMLParser and xmllib use regular expressions for
> parsing out the data. htmllib calls sgmllib at the begining of it's
> code--sgmllib starts off with a bunch of regular expressions used to
> parse data. So the only real difference there I see is that someone
> saved me the work of writing them ;0). I haven't looked at the source
> for Beautiful Soup, though I have the sneaking suspicion that most
> processing of html/xml is all based on regex's.

You can build a parser for SGML/HTML/XML documents using regexps AND  
python code. You can't do that with regexps only.
By example, suppose you work hard to build a correct regexp for matching  
an opening <a> tag. You extract this from the document: "<a href='xxx'>".  
Is it actually an <a> tag? Maybe. But the text could be inside a comment.  
Or in a CDATA section. Or inside javascript code. Or...
A regexp is good for recognizing tokens, and this can be used to build a  
parser. But regular expressions alone can't parse these kind of documents,  
just because their grammar is not regular.
(Python re engine is stronger that "mathematical" regular expressions, in  
the sense that it can handle things like backreferences (?P=...) and  
lookahead (?=...) but anyway it can't handle HTML)

-- 
Gabriel Genellina