data structure suggestion (native python datatypes or sqlite; compound select)

Vlastimil Brom vlastimil.brom at gmail.com
Thu Sep 16 18:11:25 EDT 2010


Hi all,
I'd like to ask for suggestions regarding suitable datastracture for
storing textual metadata along with a plain text string.
The input format I have is simply text with custom tags like <tag_name
tag_value>; I'd prefer to have this human readable format the original
data source.
For the application, the tags are extracted and stored in a custom
datastructure along with the  plain text.The queries should either
return the tagset for a given text position (index) or reversely the
text indices for a given tag-value combination..
(I posted some more detailed remarks earlier, as I was beginning with this topic
http://mail.python.org/pipermail/python-list/2007-December/1130275.html
http://mail.python.org/pipermail/python-list/2008-May/1141958.html

Meanwhile I managed to get working code using python native
datastructures (nested defaultdicts and sets); however after some time
I am now planning to adapt this code for a web-based program and am
considering alternative datastructures, for simplicity I started with
sqlite (which should actually suffice for the data volume and traffic
in question).

I put together some code, which works as expected, but I suspect
somehow, that there must be better ways of doing it.

Two things I am not quite clear about are using the placeholders for
the data identifiers and "chaining" the SELECT parameters.

I Couldn't find a way to use "?" placeholder for table or column
names, hence I ended up using string interpolation for them and
placeholders for the data values, like.
curs.execute('SELECT * FROM "%s" WHERE "%s"==?' % (text_name,
index_col), (text_index,))
is there a better way or is it not supposed to supply these
identifiers programatically?

For getting the matching text indices given the tags, tag_values
combination I ended up with a clumsy query:

combined_query_list = ['SELECT "%s" FROM "%s" WHERE "%s"==?' %
(index_col, text_name, tag) for tag in tags]
sql_query = " INTERSECT ".join(combined_query_list)
curs.execute(sql_query, tag_values)

which produces e.g.:
SELECT "ind" FROM "n" WHERE "KC"==? INTERSECT SELECT "ind" FROM "n"
WHERE "VN"==?

or alternatively:

select_begin = 'SELECT "%s" FROM "%s" WHERE ' % (index_col, text_name)
where_subquery = " AND ".join('"%s"==?' % (tag,) for tag in tags)
sql_query = select_begin + where_subquery

with the resulting query string like:
SELECT "ind" FROM "n" WHERE "KC"==? AND "VN"==? ('12', '1')

(BTW, are these queries equivalent, as the outputs suggest, or are
there some distinctions to be aware of?)

Anyway, I can't really believe, this would be the expected way ...

(I only marginally looked into sqlalchemy, which might simplify this a
bit, is this true? - Currently I only use the standard lib, depending
on the available server setup (python 2.6)).

Thanks in advance for any suggestions or pointers on both the sql
usage as well as the general datatype  question.

regards,
   Vlastimil Brom



More information about the Python-list mailing list