Code that ought to run fast, but can't due to Python limitations.

Sun Jul 5 10:25:31 EDT 2009

2009/7/5 Hendrik van Rooyen <mail at microcorp.co.za>:
> I cannot see how you could avoid a python function call - even if he
> bites the bullet and implements my laborious scheme, he would still
> have to fetch the next character to test against, inside the current state.
>
> So if it is the function calls that is slowing him down, I cannot
> imagine a solution using less than one per character, in which
> case he is screwed no matter what he does.

A simple solution may be to read the whole input HTML file in a
string. This potentially requires lots of memory but I suspect that
the use case by far most common for this parser is to build a DOM (or
DOM-like) tree of the whole document. This tree usually requires much
more memory that the HTML source itself.

So, if the code duplication is acceptable, I suggest keeping this
implementation for cases where the input is extremely big *AND* the
whole program will work on it in "streaming", not just the parser
itself.

Then write a simpler and faster parser for the more common case when
the data is not huge *OR* the user will keep the whole document in
memory anyway (e.g. on a tree).

Also: profile, profile a lot. HTML pages are very strange beasts and
the bottlenecks may be in innocent-looking places!

-- 
Lino Mastrodomenico