Newline (NuBe Question)

Mon Nov 27 02:49:07 EST 2023

Avi,

On 11/27/2023 4:15 PM, avi.e.gross at gmail.com wrote:
> Dave,
> 
> Back on a hopefully more serious note, I want to make a bit of an analogy
> with what happens when you save data in a format like a .CSV file.
> 
> Often you have a choice of including a header line giving names to the
> resulting columns, or not.
> 
> If you read in the data to some structure, often to some variation I would
> loosely call a data.frame or perhaps something like a matrix, then without
> headers you have to specify what you want positionally or create your own
> names for columns to use. If names are already there, your program can
> manipulate things by using the names and if they are well chosen, with no
> studs among them, the resulting code can be quite readable. More
> importantly, if the data being read changes and includes additional columns
> or in a different order, your original program may run fine as long as the
> names of the columns you care about remain the same.
> 
> Positional programs can be positioned to fail in quite subtle ways if the
> positions no longer apply.

Must admit to avoiding .csv files, if possible, and working directly 
with the .xls? original (cf expecting the user to export the .csv - and 
NOT change the worksheet thereafter).

However, have recently been using the .csv format (as described) as a 
placeholder or introduction to formatting data for an RDBMS.

In a tabular structure, the expectation is that every field (column/row 
intersection) will contain a value. In the RDBMS-world, if the value is 
not-known then it will be recorded as NULL (equivalent of Python's None).

Accordingly, two points:
1 the special case of missing/unavailable data can be handled with ease,
2 most 'connector' interfaces will give the choice of retrieving data 
into a tuple or a dictionary (where the keys are the column-names). The 
latter easing data-identification issues (as described) both in terms of 
improving over relational-positioning and name-continuity (or column 
changes/expansions).

The point about data 'appearing' without headings should be considered 
carefully. The phrase "create your own names for columns" only vaguely 
accesses the problem. If someone else has created/provided the data, 
then we need to know the exact design (schema = rules). What is the 
characteristic of each component? Not only column-names, but also what 
is the metric (eg the infamous confusion between feet and meters)...

> As I see it, many situations where some aspects are variable are not ideal
> for naming. A dictionary is an example that is useful when you have no idea
> how many items with unknown keys may be present. You can iterate over the
> names that are there, or use techniques that detect and deal with keys from
> your list that are not present. Not using names/keys here might involve a
> longer list with lots of empty slots to designate missing items, This
> clearly is not great when the data present is sparse or when the number of
> items is not known in advance or cannot be maintained in the right order.

Agreed, and this is the draw-back incurred by folk who wish to take 
advantage of the schema-less (possibility) NoSQL DBs. The DB enjoys 
flexibility, but the downstream-coder has to contort and flex to cope.

In this case, JSON files are an easy place-holder/intro for NoSQL DBs - 
in fact, Python dicts and MongoDB go hand-in-glove.

The next issue raised is sparseness. In a table, the assumption is that 
all fields, or at least most of them, will be filled with values. 
However, a sparse matrix would make such very 'expensive' in terms of 
storage-space (efficacy).

Accordingly, there are other ways of doing things. All of these involve 
labeling each data-item (thus, the data expressed as a table needs to be 
at least 50% empty to justify the structural change).

In this case, one might consider a tree-type of structure - and if we 
have to continue the pattern, we might look at a Network Database 
methodology (as distinct from a DB on a network!)

> There are many other situations with assorted tradeoffs and to insist on
> using lists/tuples exclusively would be silly but at the same time, if you
> are using a list to hold the real and imaginary parts of a complex number,
> or the X/Y[/Z] coordinates of a point where the order is almost universally
> accepted, then maybe it is not worth using a data structure more complex or
> derived as the use may be obvious.

No argument (in case anyone thought I might...)

See @Peter's earlier advice.

Much of the consideration (apart from mutable/immutable) is likely to be 
ease of coding. Getting down 'into the weeds' is probably pointless 
unless questions are being asked about (execution-time) performance...

Isn't the word "obvious" where this discussion started? Whereas "studs" 
might be an "obvious" abbreviation for "students" to some, it is not to 
others (quite aside from the abbreviation being unnecessary in this 
day-and-age).

Curiously, whereas I DO happen to think a point as ( x, y, ) or ( x, y, 
z, ) and thus quite happily interpret ( 1, 2, 3, ) as a location in 3D 
space, I had a trainee bring a 'problem' on this exact assumption:-

He had two positions ( x1, y1, ) and ( x2, y2, ) and was computing the 
vector between them ( x2 - x1, y2 - y1 ), accordingly:

def compute_distance( x1, x2, y1, y2, ):
     # with return calculated as above

Trouble is, the function-call was:

result = compute_distance( x1, y1, x2, y2, )

In other words, the function's signature was consistent with the 
calculation. Whereas, the function-call was consistent with the way the 
data had 'arrived'. Oops!

As soon as a (data)class Point( x, y, ) was created, the function's 
signature became:

def compute_distance( starting_point:Point, ending_point:Point, ):
     # with amended return calculation

and the function-call became congruent, naturally.
(in fact, the function was moved into the dataclass to become a method 
which simplified the signature and call(s) )

Thus, what was "obvious" to the same guy's brain when he was writing the 
function, and what seemed "obvious" when the function was being used, 
were materially (and catastrophically) different!

So, even though we (two) might think in terms of "universally", we 
are/were wrong!

Thus, a DESIGNED data-type helps to avoid errors, and even when the 
data-usage seems "obvious", offers advantage!

Once again, am tempted to suggest that the saving of:

point = ( 1, 2, )

over:

@dataclass
class Point():
     x:float
     y:float

is about as easily justified as preferring "studs" over the complete 
word "students".

YMMV!
(excepting Code Review expectations)

* will an AI-Assistant code this for us, and thus remove any 'amount of 
typing' complaint?

> I do recall odd methods sometimes used way back when I programmed in C/C++
> or similar languages when some method was used to declare small constants
> like:
> 
> #define FIRSTNAME 1
> #define LASTNAME 2
> 
> Or concepts like "const GPA = 3"
> 
> And so on, so code asking for student_record[LASTNAME] would be a tad more
> readable and if the order of entries somehow were different, just redefine
> the constant.

I've been known to do this in Python too! This example is congruent with 
what was mentioned (elsewhere/earlier): that LASTNAME is considerably 
more meaningful than 2.

Programming principles includes advice that all 'magic constants' should 
be hoisted to the top of the code (along with import-statements). Aren't 
those positional indices 'magic constants'?

> In some sense, some of the data structures we are discussing, under the
> hood, actually may do something very similar as they remap the name to a
> small integer offset. Others may do much more or be slower but often add
> value in other ways. A full-blown class may not just encapsulate the names
> of components of an object but verify the validity of the contents or do
> logging or any number of other things. Using a list or tuple does nothing
> else.

Not in Python: database keys must be hashable values - for that reason.

Argh! The docs (https://docs.python.org/3/tutorial/datastructures.html) 
don't say that - or don't say it any more. Did it change when key-order 
became guaranteed, or do I mis-remember?

Those docs say "immutable" - but whilst "hashable" and "immutable" have 
related meanings, they are not exactly the same in effect.

Alternately, the wiki (https://wiki.python.org/moin/DictionaryKeys) does 
say "hashable"!

> So if you need nothing else, they are often suitable and sometimes even
> preferable.

Yes, (make a conscious choice to) use the best tool for the job - but 
don't let bias cloud your judgement, don't take the ideas of the MD's 
nephew as 'Gospel', and DO design the way forward...

--
Regards =dn