Most efficient way to build very large dictionaries

python at bdurham.com python at bdurham.com
Wed Dec 24 04:21:30 EST 2008


Hi Roger,

> The performance improvements you are seeing with Python over Oracle are exactly the same range people see with SQLite over Oracle. One common usage reported on the SQLite mailing list is people copying data out of Oracle and running their analysis in SQLite because of the performance advantages.

I wasn't aware that SQLite was so fast compared to Oracle. That's great
news. I will definitely take a look at SQLite for my current data
analysis project.

<background googling on Python and SQLite> ... hey, you're the author of
APSW! :)

For those following this thread, see APSW:
http://code.google.com/p/apsw/

> The pragmas tune things like cache sizes. The SQLite default is 2MB, relying on the operating system for caching beyond that. Bumping up that kind of size was my suggestion :-)

I missed that nuance - a side effect of emailing at 4am :)

Thanks again for your help Roger!

Regards,
Malcolm


----- Original message -----
From: "Roger Binns" <rogerb at rogerbinns.com>
To: python-list at python.org
Date: Wed, 24 Dec 2008 00:50:49 -0800
Subject: Re: Most efficient way to build very large dictionaries

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

python at bdurham.com wrote:
> Thank you for your suggestion about looking at SQLite. I haven't
> compared the performance of SQLite to Python dictionaries, but I'm
> skeptical that SQLite would be faster than in-memory Python dictionaries
> for the type of analysis I'm doing.

I'd recommend at least trying a test just to see.  As an example SQLite
uses indices which will be faster than Python dicts for some set
operations.  (And if you aren't careful, your various Python based
optimizations will end up duplicating what SQLite does internally anyway
:-)

> Prior to my use of Python, my
> customer used a very expensive Oracle system to analyze their log files.
> My simple Python scripts are 4-20x faster than the Oracle PL/SQL they
> are replacing - and run on much cheaper hardware.

SQLite is not like Oracle or any similar database system.  It does not
operate over the network or similar connection. It is a library in your
process that has an optimized disk storage format (single file) and a
SQL parser that generates bytecode for a special purpose virtual machine
in pretty much the same way CPython operates.  The performance
improvements you are seeing with Python over Oracle are exactly the same
range people see with SQLite over Oracle.  One common usage reported on
the SQLite mailing list is people copying data out of Oracle and running
their analysis in SQLite because of the performance advantages.

> Note: Memory is currently not a concern for me so I don't need SQLite's
> ability to work with data sets larger than my physical memory.

The pragmas tune things like cache sizes.  The SQLite default is 2MB,
relying on the operating system for caching beyond that.  Bumping up
that kind of size was my suggestion :-)

Roger

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR9+UACgkQmOOfHg372QSMbwCdGS5S2/96fWW8knjfBVqReAfV
AEwAn2Yc+L9BEZgT69OjwtyqxLtifVpU
=mPfy
-----END PGP SIGNATURE-----

--
http://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list