Code that ought to run fast, but can't due to Python limitations.

"Martin v. Löwis" martin at v.loewis.de
Sun Jul 5 15:23:50 EDT 2009


> This is a good test for Python implementation bottlenecks.  Run
> that tokenizer on HTML, and see where the time goes.

I looked at it with cProfile, and the top function that comes up
for a larger document (52k) is
...validator.HTMLConformanceChecker.__iter__.

This method dispatches various validation routines, and it computes
the method names from the input over and over again, doing lots
of redundant string concatenations. It also capitalizes the element
names, even though the spelling in the original document is probably
not capitalized (but either upper-case or lower case).

In my patch below, I create a dictionary of bound methods, indexed
by (syntax) type and name, following the logic of falling back to
just type-based validation if no type/name routine exists. However,
in order to reduce the number of dictionary lookups, it will also
cache type/name pairs (both in the original spelling, and the
capitalized spelling), so that subsequent occurrences of the same
element will hit the method cache.

With this simple optimization, I get a 20% speedup on my test
case. In my document, there are no attributes - the same changes
should be made to attribute validation routines.

I don't think this has anything to do with the case statement.

Regards,
Martin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: methodlookup.diff
Type: text/x-patch
Size: 2837 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20090705/1dcb8735/attachment-0001.bin>


More information about the Python-list mailing list