data structure suggestion (native python datatypes or sqlite; compound select)

Wed Sep 22 20:28:24 EDT 2010

2010/9/19 Dennis Lee Bieber <wlfraed at ix.netcom.com>:
> On Sat, 18 Sep 2010 23:00:25 +0200, Vlastimil Brom
> <vlastimil.brom at gmail.com> declaimed the following in
> gmane.comp.python.general:
>
>> Thank you very much for detailed hints, I see, I should have mention
>> the specification with my initial post...
>> It is true, that nested tags of the same name aren't supported, but
>> tags may appear anywhere in the text and aren't terminated with
>> newline. The tag-value association is valid from the tag position
>> until the next tag replacing the value or closing tag (like </b>) or
>
>        Okay... Then put the start/end fields back into the database <G>
>
>        Also, if the data can span lines, using a foreign key to a table
> having the lines is meaningless... I'd probably generate a plain text
> file (one that does not contain any tags), and record the start/end
> positions for the tags based as a pure character count from start of
> file. Optional if you strip new-lines from this file so it is just a
> long line of text...
>
>        Parsing may need to be recursive so you can determine the end points
> for outer tags and generate the full record (type, start, end,
> supplement) {I didn't list ID, though my standard practice is to always
> have an auto-increment ID field in a table -- it simplifies later
> updates}.
>
>        Do the tags form a hierarchy? That is, does a "higher level" tag
> force a closure of all unclosed lower levels?
>
>> I'll have a closer look on joins in sql and maybe redesign the data
>> structure - now the tags data are copied for each text position with
>> some tag change - in order to simplify queries; with multiple tables
>> it could be more efficient to store the tags separately and look it up
>> individually (probably using bisect (unless there is an SQL equivalent
>> ?)
>
>        I'm not sure of how you mean "bisect" but I do suggest that you look
> up a lesson on "database normalization" (emphasis on first, second, and
> third normal forms; the others are rather esoteric).
>
>        What I'm pretty sure you do NOT want to do is create a table for
> each tag TYPE. The type is just another data value, as in my sample
> tables. That makes queries much simpler since you don't run into the
> problem of having to generate queries where you change the schema
> entities ("schema" is the technical term for the layout of the database
> -- the fixed items rather than the data; end users should never directly
> enter schema items in an application [you can present a menu of schema
> items and let the user pick one, but the selection is never used as-is,
> it has to be restricted to a list you generate, and returns values from
> your list; look up "sql injection"])
>
>        With the actual text in a plain file, the joins I was using are not
> needed -- you'd use the start/end results to seek/read the file as a
> separate step.
>
>        However, you might have some complex queries that use subselects if
> you need to retrieve a passage inside a passage (poor phrasing).
>
>        That is, a subselect that specifies you want the start/end range of
> a tag with a particular attribute, and you use the start/end as >, <
> comparisons to find a different tag. Off the top of my head (hence may
> not be valid SQL)
>
> select t1.type, t1.supplement, t1.start, t1.end from tags as t1
>        inner join
>                (select start, end from tags
>                        where type = "some type"
>                        and supplement = "some value") as t2
>        where t1.start >= t2.start and t1.end <= t2.end
>
> --
>        Wulfraed                 Dennis Lee Bieber         AF6VN
>        wlfraed at ix.netcom.com    HTTP://wlfraed.home.netcom.com/
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Thanks all for the pointers and recommendations!
After some tests I eventually found, that I am (for now) probably
better off with nested (default)dict and set data;
while querying tags for a given index was very elegant even with a
very simple database design, I wasn't able to get the other query
(text interval(s) for given combination of tag-values) with a
comparably straightforward solution. It seems, that in order to handle
tag combinations of arbitrary length either programmatically created
queries or maybe recursion or some kind of postprocessing database
data using interval arithmetic would be needed (or probably some
database features, I am not getting :)
In any case the initial custom datastructure seems to meet the
requirements after all.

Thanks again,
          Vlastimil Brom