[Tutor] looking but not finding

Alan Gauld alan.gauld at yahoo.co.uk
Wed Jul 12 19:49:59 EDT 2023


This is late at night so I won;t reply to all the points
but I'll pick up a few easier ones! :-)


On 12/07/2023 16:26, o1bigtenor wrote:
> What I'm finding is that the computing world uses terms in a different way than
> does the sensor world than does the storage world than does the rest of us.

Sadly true. I come from an electronic/telecomms background but
most of my career was in software engineering. SE is a very new
field and doesn't have the well established vocabulary of more
traditional engineering fields. Often one term can have
multiple meanings depending on context. And other times multiple
terms are used for the exact same thing.



>> from structured files(shelves, JSON, YAML, XML, etc to
>> databases both SQL and NoSQL based.
> 
> This is where I start getting 'lost'. I have no idea which of the three
> listed formats will work well

I actually listed 4! :-)
shelve is a Python proprietary format that will take arbitrary Python
data structures and save them to a file that can then be accessed like a
dictionary. Its very easy to use and works well if your data can be
found via a simple key. For more general use it has issues.

JSON, YAML, XML are all text based files using structured text
to define the nature of the data. JSON looks a lot like a Python
dictionary format. XML is good for complex data but is correspondingly
complex to work with. YAML is somewhere in the middle.
For your purposes JSON is probably a good choice.

Wikipedia describes all of them in much more detail.


> The first point with the data is to store it. At the time of its collection
> the data will also be used to trigger other events so that will be (I think
> that is the proper place) be done in the capturing/organizing/supervising
> program (in python).

That raises a whole bunch of other questions such as is the data
stateful? Does the processing of sample Sn depend on the values
of Sn-1, Sn-2... If so you need to store a "window" of data in
memory and then write out the oldest item to storage as it expires.

If not you can probably just read, process, store.

> honestly cannot tell what makes data
> 'irregular'

It just means that different data items have different attributes.
Some fields may be optional and different readings will contain
different numbers of fields.
Or the same fields may contain different types of data, for example
one sensor reports on/off while another gives a magnitude and
another gives a status message. That would be irregular data
and SQL databases don't like it much (although there are
techniques to get round it)

Regular data just means every reading looks the same so you can
define a table with a set number of columns each of a known type.

> Now if I only knew what regular and/or irregular data was.
> Have been considering using Postgresql as a storage engine.
> AFAIK it has the horsepower to deal with serious large amounts of data.

It will do the job but so would something much lighter like SQLite.
The really critical factor is how much parallel access you need.
SQLite is great if only a single program is reading and writing
the data. Especially if there is only a single thread of execution.
The server based databases like Postgres come into their own if
you have multiple clients accessing the database at once. They can
perform tricks like record level locking (rather than table level
or even database level) during writes.

> Its not serious huge data amounts but its definitely non-
> trivial.

Modern storage solutions have made even Gigabytes of data almost
trivial. Big volumes would be more of an issue if using flat files
because they need to be read from disk and that takes time for big
volumes - even using SSDs.

> Concurrent - - - yes, there are multiple sensors per station and
> multiple stations and hopefully they're all working (problems
> with the system if they're not!)

You can deal with a relatively small number of sensors in a single
threaded Python application, just by polling each one in sequence.
But if the read/process/store time gets bigger then that will limit
how many cycles you can perform in your 0.5 second window. One option
is to have multiple threads, each reading a sensor(or sensor group).
Another is to split the processing out to a separate program that
reads the recorded data from storage and processes it before issuing
updates to the sensors as needed.

This takes us into the thorny world of systems architecture and
concurrency management both of which are complex and depend on
detailed analysis of requirements. Probably at a level beyond
the scope of this list! (Although that was my day job before
I retired!)

> There is a small amount of analysis being done as the data is stored
> but that's to drive the system (ie at point x you stop this, at point
> y you do this, point y does this for z time then a happens etc - - - I
> do not consider that heavy duty analysis. That happens on a different
> system (storage happens first to the local governing system. 

OK that sounds like the second of my options above. That's a
perfectly viable approach.

> a round of data (one cycle) has completed that data is shipped in a burst
> to the long term storage system.

That again is viable but does run the risk of losing all data for the
cycle if anything breaks. But if the data are interdependant that may
not matter and may even be a good thing.

> of information - - - or that's the goal anyway. Maybe One system will store the
> info and another will do analysis and then return that analysis to the storing
> system for accumulation - - - also not decided.)

That's fine you need very careful analysis to determine the
architecture and it will need to be based on throughput
calculations etc. (This is where the old software engineering
meme about only considering performance after you find there
is a problem does not apply. The cost of refactoring a
whole architecture is very high!)

> what I should be using for what you termed 'structured files'.

The more I read your posts the more I think you should go with a
database and probably a server based one because i think you'll
wind up with several independent programs reading/writing data concurrently.

I'm still not 100% sure if it's a SQL or NoSQL solution,
but my gut says SQL will be adequate.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos





More information about the Tutor mailing list