[Tutor] How best to structure a plain text data file for use in program(s) and later updating with new data?

Wed Oct 8 16:56:02 CEST 2014

About two years ago I wrote my most ambitious program to date, a
hodge-podge collection of proprietary scripting, perl and shell files
that collectively total about 20k lines of code. Amazingly it actually
works and has saved my colleagues and I much time and effort. At the
time I created this mess, I was playing "guess the correct proprietary
syntax to do something" and "hunt and peck perl" games and squeezing
this programming work into brief snippets of time away from what I am
actually paid to do. I did not give much thought to design at the time
and knew I would regret it later, which is now today! So now in my
current few snippets of time I wish to redesign this program from
scratch and make it much, ... , much easier to maintain the code and
update the data tables, which change from time to time. And now that I
have some version of python available on all of our current Solaris 10
systems (python versions 2.4.4 and 2.6.4), it seems like a fine time
to (finally!) do some serious python learning.

Right now I have separated my data into their own files. Previously I
had integrated the data with my source code files (Horrors!).
Currently, a snippet from one of these data files is:

NUMBER_FX:ONE; DATA_SOURCE:Timmerman; RELEASE_DATE:(11-2012);

SERIAL_ROI:Chiasm; TEST_VOLUME:< 0.2 cc; VOLUME_MAX_GY:8.0;
MAX_PT_DOSE_GY:10.0; MAX_MEAN_DOSE: ;
SERIAL_ROI:Optic_Nerve_R; TEST_VOLUME:< 0.2 cc; VOLUME_MAX_GY:8.0;
MAX_PT_DOSE_GY:10.0; MAX_MEAN_DOSE: ;
SERIAL_ROI:Optic_Nerve_L; TEST_VOLUME:< 0.2 cc; VOLUME_MAX_GY:8.0;
MAX_PT_DOSE_GY:10.0; MAX_MEAN_DOSE: ;

[...]

PARALLEL_ROI:Lungs_Bilateral; CRITICAL_VOLUME_CC:1500.0;
CRITICAL_VOLUME_DOSE_MAX_GY:7.0; V8GY: ; V20GY: ; MAX_MEAN_DOSE: ;
PARALLEL_ROI:Lungs_Bilateral; CRITICAL_VOLUME_CC:1000.0;
CRITICAL_VOLUME_DOSE_MAX_GY:7.6; V8GY:< 37.0%; V20GY: ; MAX_MEAN_DOSE:
;
PARALLEL_ROI:Liver; CRITICAL_VOLUME_CC:700.0;
CRITICAL_VOLUME_DOSE_MAX_GY:11.0; V8GY: ; V20GY: ; MAX_MEAN_DOSE: ;
PARALLEL_ROI:Renal_Cortex_Bilateral; CRITICAL_VOLUME_CC:200.0;
CRITICAL_VOLUME_DOSE_MAX_GY:9.5; V8GY: ; V20GY: ; MAX_MEAN_DOSE: ;
[EOF]

I just noticed that copying from my data file into my Google email
resulted in all extra spaces being condensed into a single space. I do
not know why this has just happened. Note that there are no tab
characters. The [...] indicates omitted lines of serial tissue data
and [EOF] just notes the end-of-file.

I am far from ready to write any code at this point. I am trying to
organize my data files, so that they will be easy to use by the
programs that will process the data and also to be easily updated
every time these data values get improved upon. For the latter, I
envision writing a second program to enable anyone to update the data
tables when we are given new values. But until that second program
gets written, the data files would have to be opened and edited
manually, which is why I have labels included in all-caps ending in a
colon. This is so the editor will know what he is editing. So,
basically the actual data fields fall between ":" and ";" . String
representations of numbers will need to get converted to floats by the
program. Some fields containing numbers are of a form like "< 0.2 cc"
. These will get copied as is into a GUI display, while the "0.2" will
be used in a computation and/or comparison. Also notice that in each
data file there are two distinct groupings of records--one for serial
tissue (SERIAL_ROI:) and one for parallel tissue (PARALLEL_ROI). The
fields used are different for each grouping. Also, notice that some
fields will have no values, but in other data files they will have
values. And finally the header line at the top of the file identifies
for what number of fractions (FX) the data is to be used for as well
as the source of the data and date that the data was released by that
source.

Finally the questions! Will I easily be able to use python to parse
this data as currently structured, or do I need to restructure this? I
am not at the point where I am aware of what possibilities python
offers to handle these data files. Also, my efforts to search the 'net
did not turn up anything that really clicked for me as the way to go.
I could not seem to come up with a search string that would bring up
what I was really interested in: What are the best practices for
organizing plain text data?

Thanks!

-- 
boB