Suggestions on mechanism or existing code - maintain persistence of file download history

Thu Jan 30 19:01:28 EST 2020

On 30/01/20 9:35 PM, R.Wieser wrote:
>> MRAB's scheme does have the disadvantages to me that Chris has pointed
>> out.
> Nothing that can't be countered by keeping copies of the last X number of
> to-be-dowloaded-URLs files.

That's a good idea, but how would the automated system 'know' to give-up 
on the current file and utilise generation n-1? Unable to open the file 
or ???

> As for rewriting every time, you will /have/ to write something for every
> action (and flush the file!), if you think you should be able to ctrl-c (or
> worse) out of the program.

Which is the nub of the problem!

Using ctrl+c is a VERY BAD idea. Depending upon the sophistication of 
the solution/existing code, surely there is another way...

Even closing/pulling-out the networking connection to cause an exception 
within Python, would enable management of a more 'clean' and 'data safe' 
shutdown!
(see also 'sledgehammer to crack a nut')

Why do you need to abandon the process mid-way?

> But, you could opt to write this sessions successfully downloaded URLs to a
> seperate file, and only merge that with the origional one program start.
> That together with an integrity check of the seperate file (eventually on a
> line-by-line (URL) basis) should make the origional files corruption rather
> unlikely.

What is the OP's definition of "unlikely" or "acceptable risk"?
If RDBMS == "unnecessary complexity", then (presumably) 'concern' will 
be commensurately low, and much of the discussion to-date, moot?

I've not worked on 'downloads' (which I take to mean data files, eg 
forms from the tax office - guess what task I'm procrastinating over?) 
but have automated the downloading of web page content/headers. There 
are so many reasons why such won't work first-time, when they should 
every time; that it may be quite difficult to detect 'corruption' (as 
distinct from so many of these other issues that may arise)...

> A database /sounds/ good, but what happens when you ctrl-c outof a
> non-atomic operation ?   How do you fix that ?    IOW: Databases can be
> corrupted for pretty-much the same reason as for a simple datafile (but with
> much worse consequences).

[apologies for personal comment]
I, (with my skill-set, tool-set, collection of utilities, ... - see 
earlier mention of "bias") reach for an RDBMS more quickly than many*. 
Mea culpa or 'more power to [my] right arm'?

The DB suggestion (posted earlier) involved only a single table, to 
which fields would be added/populated during processing as a record of 
progress/status. Thus, replacing the single file that the OP 
(originally) outlined as fitting his/her needs, with a single DB-table.

Accordingly, there is no non-atomic transaction in the proposal - UPDATE 
is atomic in most (competent) RDBMS.
(again, in my ignorance of that project, please don't (anyone) think I'm 
including/excluding SQLite)

Contrarily, if the 'single table idea' is hardly a "database" by most 
definitions, why bother? The answer lies in the very mechanisms to 
combat corruptions and interruptions being discussed! As a 
fundamentally-lazy person, I'd rather leave the RDBMS-coders to wrestle 
with such complexities 'for me'. Then, I can 'stand on the shoulders' of 
such 'giants', by driving their (competently working) 'black box'...
(YMMV!)

Now, it transpires, the OP possesses DB skills. So, (s)he is in a 
position to make the go/no decision which suits the actual spec. Yahoo! 
(not TM)

> Also think of the old adagio: "I had a problem, and than I thought I could
> use X.  Now I have two problems..." - with X traditionally being "regular
> expressions".   In other words: do KISS (keep it ....)

Good point! (I'm not a great fan of RegEx-es either)
- reduce/avoid complexity, "simple is better than complex"! (Python: 
import this)

Surely though, it is only appropriate to dive into the concerns and 
complexities of DB accuracy and "consistency", if we do likewise with 
file systems?

The rationale of my 'laziness' argument 'for' using an RDBMS, also 
applies to plain-vanilla file systems. Do I want to deal with the 
complexities of managing files and corruptions, in that arena?
(you could easily guess the answer to that!)

Do you?
(the answer may be quite different - but no matter, I'm not going to say 
you are "wrong", as long as in making such a decision (files?DB) we 
compare 'like with like' - in fact, before that: as long as the client's 
spec says that we need to be worrying about such detail!
(otherwise YAGNI applies!)

> By the way: The "just write the URLs in a folder" method is not at all a bad
> one.   /Very/ easy to maintain, resilent (especially when you consider the
> self-repairing capabilities of some filesystems) and the polar opposite of a
> "customer lock-in". :-)

+1
Be aware that formation rules for URLs are not congruent with OS FS rules!
(such concerns don't apply if the URLs are data within a file/table)

* was astonished to discover (a show-of-hands poll at some conference or 
other) that 'the average applications programmer' dislikes SQL/RDBMS and 
would rather have 'someone else' handle that side of things. Most of 
those ascribed their attitude to not having been able to 'get [their] 
heads around SQL' - which left me baffled because I 'just see it'. 
However, my mental processes have been queried (more than once)! Upon 
reflection, this 'discovery' made me happy - found me another niche to 
occupy...
-- 
Regards =dn