[TriZPUG] PyDataStep project -- A super convenient way to work with databases in python -- Open call for additional use cases

Wed Apr 28 17:41:26 CEST 2010

I mentioned this briefly at the PUG meeting last night. To recap:

PyDataStep attempts to remove the guesswork of getting your data into and
out of a database, including schema generation, building relations,
querying/filtering, etc.  This also includes taking queried/filtered
datasets and converting them back to some common file/python object formats,
providing complete round trip support.

Currently, support is very good for standard tabular data that fits the
collection of records pattern.  There are about 200 test cases implemented
over 30 different test functions.  The code is also moderately well
documented, though like any project there is a bit of lag between
documentation and source.

I'd like to get this library to the point that it can automatically generate
acceptable schemas for 90-95% of the most common data arrangements, and make
it easy to hook into for the other 5-10% so that it's useful unless you're
doing something completely unique.  I'd also like to polish the facade I
present over sql alchemy to the point that it's pretty rare you need to grab
the table's Session object and do things with it directly.  For the things I
do, PyDataStep is the fastest, easiest way to do things with arbitrary data
using SQL and still have flexibility.  In order to continue developing the
architecture and facade to meet the stated goals I need input from others on
the sorts of models they often generate, what they do with those models and
the sorts of queries they use.  It'd also be helpful to me if people who
have to use a database on a regular basis for a variety of things would try
the library out and see what feels intuitive and what feels clunky for them
-- it'd also help root out new test cases that I need to code :)  In order
to make this as painless as possible I'll happily answer any and all
questions regarding the library and provide step by step instructions.  If
you're interested but there's some generic functionality that's rough or
missing, let me know.

You can get the development branch of PyDataStep via launchpad, using
bazaar: "bzr branch lp:~nathan-alexander-rice/pydatastep/dev". I intend to
build eggs and toss it on PyPI in the near future but I want to shake some
more changes out of the API before I do so.

As an example, assume you have a spreadsheet with information on all the
people or computers in your business (not entirely uncommon).  As of right
now, If you wanted to load this data into a database and query for up to 10
computer running Linux on a particular subnet with a given service and at
least around 8 gigs of RAM or a 3ghz processor (assuming the data set was
*moderately* well formed), you would do the following:

import pydatastep.structures.table as table
import pydatastep.database.loaders.file as file_
import pydatastep.database.utilities as util

connector = util.DatabaseConnector(your_db_config_file)
computer_table = table.Table(file_.DelimitedFileLoader, your_input_file,
connector=connector)
results = computer_table.filter('os == "linux"', 'subnet like
"aaa.bbb.ccc.%"', 'ram >= 8000000000 or processor_speed >= 3000000000',
'service like "%yourservice%"')[:10]

the util import and the construction of the connector could be avoided, I
intend to add support for locating a .pydatastep configuration on the
filesystem.  This leaves 2 imports and 2 lines of code to import a file into
a database and perform a complex query on it.

If you wanted to commit the data to the database you just do:

computer_table.save()

If you had another dataset, that had information about the assigned employee
for each workstation, you could load that in much the same fashion.
Assuming you loaded that to users_table, you could then derive a joined
table via:

computer_table.merge(users_table)

and if you wanted to find the ip of john smith's linux workstation, you
could do the following:

computer_table.filter('os == "linux"', 'first_name == "john"', 'last_name ==
"smith"')

Graph structures are also supported.  Currently only networkx graphs are
imported, however since networkx supports a large number of file formats,
that's not too much of a problem.  Say for instance you had a FOAF style
social graph built as a networkx object, and you wanted to persist the
information in a database then find everyone who was a friend of either jack
or jill but not bob.  You could do the following:

import pydatastep.structures.graph as graph

connector = util.DatabaseConnector(your_db_config_file)
friend_graph = graph.Graph(graph=your_graph_object)
connections = friend_graph.edges('start in ["jack", "jill"] or end in
["jack", "jill"]', 'start != "bob"', 'end != "bob"')
friends_of_jack_or_jill = set()
for connection in connections:
    friends_of_jack_or_jill.update([connection.start, connection.end])

Support for attaching units to columns is also present.  If you had a table
with cars and one of the columns was in miles per gallon, you could indicate
that like so:

car_table.set_units("fuel_use", "mile / gallon")

and you wanted to convert it to kilometers per litre, you could  following:

car_table.set_units("fuel_use", "meter / litre")

and it would automatically rescale the values to be accurate, while updating
the metadata for that column in a metadata table so that the next time you
load car_table it knows the units of "fuel_use" you set previously.  You can
also remove units from a column with:

car_table.set_units("fuel_use", None) # None is the default value so
car_table.set_units("fuel_use") works too.

 Thanks for your time and consideration,

Nathan Rice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/trizpug/attachments/20100428/dc434e7b/attachment.html>