Frankenstring

Mon Jul 18 15:23:20 EDT 2005

Peter Otten wrote:

> I hope you'll let us know how much faster your
> final approach turns out to be

OK, here's a short report on the current state. Such code as there is can
be found at <http://svn.thomas-lotze.de/PyASDF/pyasdf/_frankenstring.c>,
with a Python mock-up in the same directory.

Thinking about it (Andreas, thank you for the reminder :o)), doing
character-by-character scanning in Python is stupid, both in terms of
speed and, given some more search capabilities than str currently has,
elegance.

So what I did until now (except working myself into writing extensions
in C) is give the evolving FrankenString some search methods that enable
searching for the first occurrence in the string of any character out of
a set of characters given as a string, or any character not in such a
set. This has nothing to do yet with iterators and seeking/telling.

Just letting C do the "while data[index] not in whitespace: index += 1"
part speeds up my PDF tokenizer by a factor between 3 and 4. I have
never compared that directly to using regular expressions, though... As
a bonus, even with this minor addition the Python code looks a little
cleaner already:

        c = data[cursor]

        while c in whitespace:
            # Whitespace tokens.
            cursor += 1

            if c == '%':
                # We're just inside a comment, read beyond EOL.
                while data[cursor] not in "\r\n":
                    cursor += 1
                cursor += 1

            c = data[cursor]

becomes

        cursor = data.skipany(whitespace, start)
        c = data[cursor]

        while c == '%':
            # Whitespace tokens: comments till EOL and whitespace.
            cursor = data.skipother("\r\n", cursor)
            cursor = data.skipany(whitespace, cursor)
            c = data[cursor]

(removing '%' from the whitespace string, in case you wonder).

The next thing to do is make FrankenString behave. Right now there's too
much copying of string content going on everytime a FrankenString is
initialized; I'd like it to share string content with other
FrankenStrings or strs much like cStringIO does. I hope it's just a
matter of learning from cStringIO. To justify the "franken" part of the
name some more, I consider mixing in yet another ingredient and making
the thing behave like a buffer in that a FrankenString should be
possible to make from only part of a string without copying data.

After that, the thing about seeking and telling iterators over
characters or search results comes in. I don't think it will make much
difference in performance now that the stupid character searching has
been done in C, but it'll hopefully make for more elegant Python code.

-- 
Thomas