Code that ought to run fast, but can't due to Python limitations.

Sat Jul 4 23:37:16 EDT 2009

In article <4a501a5e$0$1640$742ec2ed at news.sonic.net>,
John Nagle  <nagle at animats.com> wrote:
>
>    Here's some actual code, from "tokenizer.py".  This is called once
>for each character in an HTML document, when in "data" state (outside
>a tag).  It's straightforward code, but look at all those
>dictionary lookups.
>
>     def dataState(self):
>         data = self.stream.char()
>
>         # Keep a charbuffer to handle the escapeFlag
>         if self.contentModelFlag in\
>           (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
>             if len(self.lastFourChars) == 4:
>                 self.lastFourChars.pop(0)
>             self.lastFourChars.append(data)
>
>         # The rest of the logic
>         if data == "&" and self.contentModelFlag in\
>           (contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and not\
>           self.escapeFlag:
>             self.state = self.states["entityData"]
>         elif data == "-" and self.contentModelFlag in\
>           (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and not\
>           self.escapeFlag and "".join(self.lastFourChars) == "<!--":
>             self.escapeFlag = True
>             self.tokenQueue.append({"type": "Characters", "data":data})
>         elif (data == "<" and (self.contentModelFlag == contentModelFlags["PCDATA"]
>                                or (self.contentModelFlag in
>                                    (contentModelFlags["CDATA"],
>                                     contentModelFlags["RCDATA"]) and
>                                    self.escapeFlag == False))):
>             self.state = self.states["tagOpen"]
>         elif data == ">" and self.contentModelFlag in\
>           (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\
>           self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->":
>             self.escapeFlag = False
>             self.tokenQueue.append({"type": "Characters", "data":data})
>         elif data == EOF:
>             # Tokenization ends.
>             return False
>         elif data in spaceCharacters:
>             # Directly after emitting a token you switch back to the "data
>             # state". At that point spaceCharacters are important so they are
>             # emitted separately.
>             self.tokenQueue.append({"type": "SpaceCharacters", "data":
>               data + self.stream.charsUntil(spaceCharacters, True)})
>             # No need to update lastFourChars here, since the first space will
>             # have already broken any <!-- or --> sequences
>         else:
>             chars = self.stream.charsUntil(("&", "<", ">", "-"))
>             self.tokenQueue.append({"type": "Characters", "data":
>               data + chars})
>             self.lastFourChars += chars[-4:]
>             self.lastFourChars = self.lastFourChars[-4:]
>         return True

Every single "self." is a dictionary lookup.  Were you referring to
those?  If not, I don't see your point.  If yes, well, that's kind of the
whole point of using Python.  You do pay a performance penalty.  You can
optimize out some lookups, but you need to switch to C for some kinds of
computationally intensive algorithms.  In this case, you can probably get
a large boost out of Pysco or Cython or Pyrex.
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha