Code that ought to run fast, but can't due to Python limitations.

John Nagle nagle at animats.com
Sat Jul 4 13:33:15 EDT 2009


    As an example of code that really needs to run fast, but is
speed-limited by Python's limitations, see "tokenizer.py" in

	http://code.google.com/p/html5lib/

This is a parser for HTML 5, a piece of code that will be needed
in many places and will process large amounts of data. It's written
entirely in Python.  Take a look at how much work has to be performed
per character.

This is a good test for Python implementation bottlenecks.  Run
that tokenizer on HTML, and see where the time goes.

("It should be written in C" is not an acceptable answer.)

Python doesn't have a "switch" or "case" statement, and when
you need a state machine with many states, that makes for painful,
slow code.  There's a comment in the code that it would be useful
to run a few billion lines of HTML through an instrumented version
of the parser to decide in which order the IF statements should be
executed.  You shouldn't have to do that.

Yes, I've read PEP 3103.  The big problem is the difficulty of figuring
out what's a constant and what might change.  If all the cases are constants,
case statements are easy.  But in Python, the compiler can't tell.

Parsers have many named compile-time constants.  Python doesn't support
named compile-time constants, and this is one of the places where we
have to pay the bill for that limitation.

Something to think about when you need three more racks of servers
because the HTML parser is slow.

				John Nagle



More information about the Python-list mailing list