XML Considered Harmful

Tue Sep 28 19:49:05 EDT 2021

On 29/09/2021 06.53, Michael F. Stemper wrote:
> On 28/09/2021 10.53, Stefan Ram wrote:
>> "Michael F. Stemper" <michael.stemper at gmail.com> writes:
>>> Well, I could continue to hard-code the data into one of the test
>>> programs
>>
>>    One can employ a gradual path from a program with hardcoded
>>    data to an entity sharable by different programs.
>>
>>    When I am hurried to rush to a working program, I often
>>    end up with code that contains configuration data spread
>>    (interspersed) all over the code. For example:
> 
>>    1st step: give a name to all the config data:
> 
>>    2nd: move all config data to the top of the source code,
>>    directly after all the import statements:
> 
>>    3rd: move all config data to a separate "config.py" module:
>>
>> import ...
>> import config
>> ...
>>
>> ...
>> open( config.project_directory + "data.txt" )
>> ...
>>
>>> but that would mean that every time that I wanted to look
>>> at a different scenario, I'd need to modify a program.
>>
>>    Now you just have to modify "config.py" - clearly separated
>>    from the (rest of the) "program".
> 
> Well, that doesn't really address what format to store the data
> in. I was going to write a module that would read data from an
> XML file:
> 
> import EDXML
> gens = EDXML.GeneratorsFromXML( "gendata1.xml" )
> fuels = EDXML.FuelsFromXML( "fueldata3.xml" )
> 
> (Of course, I'd really get the file names from command-line arguments.)
> 
> Then I read a web page that suggested use of XML was a poor idea,
> so I posted here asking for a clarification and alternate suggestions.
> 
> One suggestion was that I use YAML, in which case, I'd write:
> 
> import EDfromYAML
> gens = EDfromYAML( "gendata1.yaml" )
> fuels = EDXML.FuelsFromYAML( "fueldata3.yaml" )
> 
>>> And when I discover anomalous behavior, I'd need to copy the
>>> hard-coded data into another program.
>>
>>    Now you just have to import "config.py" from the other program.
> 
> This sounds like a suggestion that I hard-code the data into a
> module. I suppose that I could have half-a-dozen modules with
> different data sets and ln them as required:
> 
> $ rm GenData.py* FuelData.py*
> $ ln gendata1.py GenData.py
> $ ln fueldata3.py FuelData.py
> 
> It seems to me that a more thorough separation of code and data
> might be useful.

Dear Michael,

May I suggest that you are right - and that he is right!
(which is a polite way of saying, also, that both are wrong. Oops!)
(with any and all due apologies)

There are likely cross-purposes here.

I am interpreting various clues, from throughout the thread (from when
the snowflakes were still falling!) that you and I were trained
way-back: to first consider the problem, state the requirements
("hypothesis" in Scientific Method), and work our way to a solution
on-paper. Only when we had a complete 'working solution', did we step up
to the machine (quite possibly a Card Punch, cf a 'computer') and
implement.

Also, that we thought in terms of a clear distinction between
"program[me]" and "data" - and the compiler and link[age]-editor
software technology of the time maintained such.

Whereas 'today', many follow the sequence of "Test-Driven Development"
(er, um, often omitting the initial test) of attempting some idea as
code, reviewing the result, and then "re-factoring" (improving), in a
circular progression - until it not only works, but works well.

This requires similar "stepwise decomposition" to what we learned, but
differs when it comes to code-composition. This approach is more likely
to accumulate a solution 'bottom-up' and component-wise, rather than
creating an entire (and close-to-perfect) solution first and as an whole.

Let's consider the Python REPL. Opening a terminal and starting the
Python interpreter, gives us the opportunity to write short "snippets"
of code and see the results immediately. This is VERY handy for ensuring
that an idea is correct, or to learn exactly how a particular construct
works. Thus, we can 'test' before we write any actual code (and can
copy-paste the successful 'prototype' into our IDE/editor!).

We didn't enjoy such luxury back in the good?bad old days. Young people
today - they just don't know how lucky they are!
(cue other 'grumpy old man' mutterings)

Other points to consider: 'terminals' (cf mainframes), interpreted
languages, and 'immediacy'. These have all brought "opportunities" and
thus "change" to the way developers (can) work and think! (which is why
I outlined what I think of as 'our training' and thus 'our thinking
process' when it comes to software design, above)

Another 'tectonic shift' is that in the old days 'computer time' was
hugely expensive and thus had to be optimised. Whereas these days (even
in retirement) programming-time has become the more expensive component
as computers (or compute-time in cloud-speak) have become cheaper - and
thus we reveal one of THE major attractive attributes of the Python
programming language!

Accordingly, (and now any apologies-due may be due to our colleague -
who was amplifying/making a similar comment to my earlier contribution):
if we decompose the wider-problem into (only) the aspects of collecting
the data, we can assume/estimate/document/refer to that, as a Python
function:

def fetch_operating_parameters():
    """Docstring!"""
    pass

(yes, under TDD we would first write a test to call the function and
test its results, but for brevity (hah!) I'll ignore that and stick with
the dev.philosophy point)

Decomposing further, we decide there's a need to pull-in characteristics
of generators, fuel, etc. So, then we can similarly expect to need, and
thus declare, a bunch more functions - with the expectation that they
will probably be called from 'fetch_operating_parameters()'. (because
that was our decomposition hierarchy)

Now, let's return to the program[me] cf data contention. This can also
be slapped-together 'now', and refined/improved 'later'. So, our first
'sub' input function could be:

def fetch_generator_parameters() -> tuple[ dict ]:
    """Another docstring."""
    skunk_creek_1 = {
        "IHRcurve_name" : "normal",
        "63" : "8.513",
        "105" : "8.907",
        etc
        }
    ...
    return skunk_creek_1, ...

Accordingly, if we 'rinse-and-repeat' for each type of input parameter
and flesh-out the coding of the overall input-construct
(fetch_operating_parameters() ) we will be able to at least start
meaningful work on the ensuing "process" and "output" decompositions of
the whole.

(indeed, reverting to the Input-Process-Output overview, if you prefer
to stick with the way we were taught, there's no issue with starting at
'the far end' by writing an output routine and feeding it 'expected
results' as arguments (which you have first calculated on-paper) to
ensure it works, and continuing to work 'backwards' through 'Process' to
'Input'. Whatever 'works' for you!)

Note that this is a Python-code solution to the original post about
'getting data in there'. It is undeniably 'quick-and-dirty', but it is
working, and working 'now'! Secondly, because the total-system only
'sees' a function, you may come back 'later' and improve the
code-within, eg by implementing a JSON-file interface, one for XML, one
for YAML, or whatever your heart-desires - and that you can have the
entire system up-and-running before you get to the stage of 'how can I
make this [quick-and-dirty code] better?'.

(with an alternate possible-conclusion)

Here's where "skill" starts to 'count'. If sufficient forethought went
into constructing the (sub-)function's "signature", changing the code
within the function will not result in any 'ripple' of
consequent-changes throughout the entire system! Thus, as long as
'whatever you decide to do' (initially, and during any
improvements/changes) returns a tuple of dict-s (my example only), you
can keep (learning, experimenting, and) improving the function without
other-cost!

(further reading: the Single Responsibility Principle)

So, compared with our mouldy-old (initial) training, today's approach
seems bumbling and to waste time on producing a first-attempt which
(must) then require time to be improved (and watching folk work, I
regularly have to 'bite my tongue' rather than say something that might
generate philosophical conflict). However, when combined with TDD,
whereby each sub-component is known to be working before it is
incorporated into any larger component of the (and eventually the whole)
solution, we actually find a practical and workable, alternate-approach
to the business of coding!

Yes, things are not as cut-and-dried as the attempted-description(s)
here. It certainly pays to sit down and think about the problem first -
but 'they' don't keep drilling down to 'our' level of detail, before
making (some, unit) implementation. Indeed, as this thread shows, as
long as we have an idea of the inputs required by Process, we don't need
to detail the processes, we can attack the sub-problem of Input quite
separately. Yes, it is a good idea to enact a 'design' step at each
level of decomposition (rushing past which is too frequently a problem
exhibited - at least by some of my colleagues).

Having (some) working-code also enables learning - and in this case (but
not at all), is a side-benefit. Once some 'learning' or implementation
has been achieved, you may well feel it appropriate to improve the code
- even to trial some other 'new' technique. At which point, another
relevance arises (or should!): do I do it now, or do I make a ToDo note
to come back to it later?

(see also "Technical Debt", but please consider that the fleshing-out
the rest of the solution (and 'learnings' from those steps) may
(eventually) realise just as many, or even more, of the benefits of 'our
approach' of producing a cohesive overall-design first! Possibly even
more than the benefits we intended in 'our' approach(?).

Unfortunately, it is a difficult adjustment to make (as related), and
there are undoubtedly stories of how the 'fools rush in where angels
fear to tread' approach is but a road to disaster and waste. The 'trick'
is to "cherry pick" from today's affordances and modify our
training/habits and experience to take the best advantages from both...

Hope this helps to explain why you may have misunderstood some
contributions 'here', or felt like arguing-back. Taking a step back or a
'wider' view, as has been attempted here, may show the implicit and
intended value of (many) contributions.

I'll leave you with a quote from Donald Knuth in The Art of Computer
Programming, Volume 1: Fundamental Algorithms (which IIRC was first
published in the late-60s/early-70s):  “Premature optimization is the
root of all evil.” So, maybe early-coding/prototyping and later
"optimisation" isn't all bad!
--
Regards,
=dn