format for storing textual data (for an edition) - formatting and additional info

Vlastimil Brom vlastimil.brom at gmail.com
Wed Dec 12 18:12:09 EST 2007


Hi all,
Firstly, I do apologize about this quite a long post without a concrete
programming question - I'd like to ask about the availability of some tools
or methods/concepts in python environment I am probably missing, hence I
thought, I should describe my task in some detail as well as the ways I've
tried sofar.

I am preparing an digital edition of a set of texts (probably with a
wxpython gui in the final form, not web-based (yet)); however the most
important
decision just now seems to be the way, how to store the textual data.
I need to keep track of various additional parameters for specific
portions of text (
e.g
. text source, chapter, verse number, folio of the manuscript, some
alternative numberings etc.)
First I thought, XML would be suitable, as there are some standards in
presenting old text this way; but trying this out I encountered
several problems:

I didn't find any suitable gui widget capable of a graphical (formated)
displaying of XML as well as exporting it e.g. as rtf.
but more important, the texts I have cannot be easily
treated in a hierarchical manner required for
XML, as there are overlappings between the portions of texts beeing described. (
e.g. chapter vs folio boundaries).
An option I tried with XML was to split the text into the smallest
relatively "unproblematic" units and repeat most of the attributes for each
element
- but in my opinion this way the usual benefits of XML structures are
lost and the redundancy of such a format is hardly acceptable.

Moreover the displaying isn't adressed this way
I also would like to have a search function for this set of
texts, the additional data should be preserved in the results.

For now I ended up with a kind of pseudo-markup similar to XML,
where the tags aren't structured, they
only determine the changing parameters while allowing overlapping (the tag
set is more or less
hardcoded, the nesting of tags with the same name is not allowed).
Currently the implementation is surely far from ideal -
based on regexp parsing of the tags, storing of their values along
with the text index
in a
dict. After writing the raw text to the widgets (wx - TextCtrl) the
styling can be applied according to the data in the dict,
also the additional informations for any given position can be retrieved (
e.g. numbering - currently for displaying - in future the synchronisation of
multiple interrelated texts should be possible).
Basically this concept works for me somehow, including the text search,
which can benefit from the regular expression
support directly, but I feel, this is not quite as effective or
straightforward as I would like. Especially I would like to have some
more
general solution rather than an ad hoc format, which isn't very versatile.

I was wondering if someone would have any suggestions for dealing with such
tasks. Am I maybe mistaken in my assumptions (non suitability of XML
...) or am I missing some tools or techniques usable for this?

Any hints are much appreciated,

regards,
  Vlasta
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20071213/f7b16ed3/attachment.html>


More information about the Python-list mailing list