Fastest database solution

Fri Feb 6 15:09:41 EST 2009

On Fri, Feb 6, 2009 at 5:19 AM, M.-A. Lemburg <mal at egenix.com> wrote:
> On 2009-02-06 09:10, Curt Hash wrote:
>> I'm writing a small application for detecting source code plagiarism that
>> currently relies on a database to store lines of code.
>>
>> The application has two primary functions: adding a new file to the database
>> and comparing a file to those that are already stored in the database.
>>
>> I started out using sqlite3, but was not satisfied with the performance
>> results. I then tried using psycopg2 with a local postgresql server, and the
>> performance got even worse. My simple benchmarks show that sqlite3 is an
>> average of 3.5 times faster at inserting a file, and on average less than a
>> tenth of a second slower than psycopg2 at matching a file.
>>
>> I expected postgresql to be a lot faster ... is there some peculiarity in
>> psycopg2 that could be causing slowdown? Are these performance results
>> typical? Any suggestions on what to try from here? I don't think my
>> code/queries are inherently slow, but I'm not a DBA or a very accomplished
>> Python developer, so I could be wrong.
>>
>> Any advice is appreciated.
>
> In general, if you do bulk insert into a large table, you should consider
> turning off indexing on the table and recreate/update the indexes in one
> go afterwards.
>
> But regardless of this detail, I think you should consider a filesystem
> based approach. This is going to be a lot faster than using a
> database to store the source code line by line. You can still use
> a database for the administration and indexing of the data, e.g.
> by storing a hash of each line in the database.
>

I can see how reconstructing source code from individual lines in the
database would be much slower than a filesystem-based approach.
However, what is of particular importance is that the matching itself
be fast. While the original lines of code are stored in the database,
I am performing matching based on only hashes. Would storing the
original code in the same table as the hash cause significant slowdown
if I am querying by hash only?

I think I may try this approach anyways, just to make retrieving the
original source code after finding a match faster, but I am still
primarily concerned with the speed of the hash lookups.

> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Source  (#1, Feb 06 2009)
>>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
> ________________________________________________________________________
>
> ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
>
>
>   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>           Registered at Amtsgericht Duesseldorf: HRB 46611
>               http://www.egenix.com/company/contact/
>