Code that ought to run fast, but can't due to Python limitations.

Tue Jul 7 02:31:26 EDT 2009

Steven D'Aprano wrote:
> On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
> 
>> Python is not C.
> 
> John Nagle is an old hand at Python. He's perfectly aware of this, and 
> I'm sure he's not trying to program C in Python.
> 
> I'm not entirely sure *what* he is doing, and hopefully he'll speak up 
> and say, but whatever the problem is it's not going to be as simple as 
> that.

     I didn't write this code; I'm just using it.  As I said in the
original posting, it's from "http://code.google.com/p/html5lib".
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use.  HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.

     I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad.  I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up.  So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do.  But if it's slower
than BeautifulSoup, there's a problem.

					John Nagle