Code that ought to run fast, but can't due to Python limitations.

Sat Jul 4 23:15:11 EDT 2009

Paul Rubin wrote:
> John Nagle <nagle at animats.com> writes:
>>     A dictionary lookup (actually, several of them) for every
>> input character is rather expensive. Tokenizers usually index into
>> a table of character classes, then use the character class index in
>> a switch statement.
> 
> Maybe you could use a regexp (and then have -two- problems...) to
> find the token boundaries, then a dict to identify the actual token.
> Tables of character classes seem a bit less attractive in the Unicode
> era than in the old days.

    I want to see a regular expression that expresses the HTML 5 token
parsing rules, including all the explicitly specified error handling.

    Here's some actual code, from "tokenizer.py".  This is called once
for each character in an HTML document, when in "data" state (outside
a tag).  It's straightforward code, but look at all those
dictionary lookups.

     def dataState(self):
         data = self.stream.char()

         # Keep a charbuffer to handle the escapeFlag
         if self.contentModelFlag in\
           (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
             if len(self.lastFourChars) == 4:
                 self.lastFourChars.pop(0)
             self.lastFourChars.append(data)

         # The rest of the logic
         if data == "&" and self.contentModelFlag in\
           (contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and not\
           self.escapeFlag:
             self.state = self.states["entityData"]
         elif data == "-" and self.contentModelFlag in\
           (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and not\
           self.escapeFlag and "".join(self.lastFourChars) == "<!--":
             self.escapeFlag = True
             self.tokenQueue.append({"type": "Characters", "data":data})
         elif (data == "<" and (self.contentModelFlag == contentModelFlags["PCDATA"]
                                or (self.contentModelFlag in
                                    (contentModelFlags["CDATA"],
                                     contentModelFlags["RCDATA"]) and
                                    self.escapeFlag == False))):
             self.state = self.states["tagOpen"]
         elif data == ">" and self.contentModelFlag in\
           (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\
           self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->":
             self.escapeFlag = False
             self.tokenQueue.append({"type": "Characters", "data":data})
         elif data == EOF:
             # Tokenization ends.
             return False
         elif data in spaceCharacters:
             # Directly after emitting a token you switch back to the "data
             # state". At that point spaceCharacters are important so they are
             # emitted separately.
             self.tokenQueue.append({"type": "SpaceCharacters", "data":
               data + self.stream.charsUntil(spaceCharacters, True)})
             # No need to update lastFourChars here, since the first space will
             # have already broken any <!-- or --> sequences
         else:
             chars = self.stream.charsUntil(("&", "<", ">", "-"))
             self.tokenQueue.append({"type": "Characters", "data":
               data + chars})
             self.lastFourChars += chars[-4:]
             self.lastFourChars = self.lastFourChars[-4:]
         return True


					John Nagle