Data structure recommendation?

bearophileHUGS at lycos.com bearophileHUGS at lycos.com
Tue Apr 8 12:54:56 EDT 2008


Jochen Schulz:
> This solution may be more than you actually need, but I implemented two
> metric space indexes which would do what you want (and I wanted to plug
> it anyway :)):

Please plug such good things. It seems the Python community is an
endless source of interesting modules I didn't know about. Your
(single) module looks very nice. I'll take a better look later.

Few notes:
Use xrange instead of range, for Psyco they are often the same, but if
you don't have Psyco the range may be slower and usually requires more
memory.
Java is faster than Python, but if you have some care and you know how
things are implemented in Python, you can often write ""fast"" Python
code.

In the levenshtein function you have this:
    for i in range(1, m+1):
        prev, cur = cur, [i] + [0]*n
You can speed up that levenshtein, using nearly constant (heap)
memory, avoiding that re-creation of prev and cur (see below for D
code you can back-translate to Python).

You can translate this module to D (I may even do it myself, or I may
help you do it), this has several advantages:
- If you are careful it may become (quite) faster than your Java
version.
- The resulting code may be probably as long as the Python one, or not
too much longer (but it may have or not have some extra limitations. D
is flexible but it's a statically typed language, unlike Python);
- Coding in D may be similar enough to Java coding for you, quite
differently from coding it in C;
- You can then use Pyd (http://pyd.dsource.org ) to create in a very
simple way a single compiled module with the functions/classes that
can be called by python (no intermediate python needed). Using Pyd is
quite simpler than using C + Swig or writing a C module for Python.
Later you can release the D code plus the already compiled module for
Win too.
- Disadvantages: you don't know D, you don't know how to use Pyd, and
most people out there may prefer to maintain/modify C code, even if
it's 3/4 times longer than the D code.
- The following is to show you an example of D code (that uses my D
libs for few bits, like that range(), the min(), but it's not too much
difficult to avoid using them) that you may compare to the Python one.
Note that this code is generic (it's a function template), and it
takes as input any array, that means Unicode Strings too (utf8, 16 bit
too, 32 bit too), it runs about 105-125 times faster than a quite
similar Python version (Psyco is able to speed up this Python code
about 20-25 times, so this is 4-5 times faster than Psyco):

int editDistance(T)(T[] s1, T[] s2) {
    if (len(s1) > len(s2)) { auto sa = s1; s1 = s2; s2 = sa; }
    auto r1 = range(len(s2) + 1);
    auto r2 = new int[len(r1)];
    foreach (i, c1; s1) {
        r2[0] = i + 1;
        foreach (j, c2; s2)
            r2[j+1] = c1 == c2 ? r1[j] : min(r2[j], r1[j], r1[j+1]) +
1;
        auto ra = r1; r1 = r2; r2 = ra; // swap lines
    }
    return r1[$ - 1];
}

If you have questions feel free to ask :-)

Bye,
bearophile



More information about the Python-list mailing list