python-noob - which container is appropriate for later exporting into mySql + matplotlib ?

Sun Apr 14 09:22:03 EDT 2013

On Sun, Apr 14, 2013 at 9:17 PM, rusi <rustompmody at gmail.com> wrote:
> On Apr 14, 12:56 pm, Steven D'Aprano <steve
> +comp.lang.pyt... at pearwood.info> wrote:
>> I've given my view on
>> application developers -- specifically, Firefox -- using a not-quite ACID
>> database in a way that is fragile, can cause data loss,
>
> FUD
> Are you saying that flat-files dont lose data?

If they do, a human being can easily open them up and see what's
inside. Suppose bookmarks are stored like this:

r"""Some-Browser-Name web bookmarks file - edit with care
url: http://www.google.com/
title: Search engine
icon: whatever-format-you-want-to-use

url: http://www.duckduckgo.com/
title: Another search engine

url: http://www.python.org/

url: ftp://192.168.0.12/
title: My FTP Server
desc: Photos are in photos/, videos are in videos/
 Everything else is in other/
user: root
pass: secret
"""

The parsing of this file is pretty simple. Blank line marks end of
entry; indented line continues the previous attribute (like RFC822),
everything else is "attribute: value". (You might even be able to
abuse an RFC822 parser/compositor for the job.) The whole file has to
be read and rewritten for any edits, so it's unsuited to gigabytes of
content; but we're talking about *web browser bookmarks* here. I know
some people have a lot of them, but hardly gigs and gigs. And if you
think they will, then all you need to do is have multiple files, eg
one for each folder in the bookmark tree.

Now suppose it gets damaged somehow. Firstly, that's a lot less likely
with a simple file format and a "write to temp file, then move temp
file over main file" setup; but mainly, it's very easy to
resynchronize - maybe there'll be one bookmark (or a group of
bookmarks) that get flagged as corrupted, but everything after that
can be parsed just fine - as soon as you get to a blank line, you
start parsing again. Very simple. Well suited to a simple task. (Note,
however, that the uber-simple concept I've posited here would have the
same concurrency problems that Firefox has. At very least, it'd rely
on some sort of filesystem-level lock when it starts rewriting the
file. But this is approximately similar to running two instances of a
text editor and trying to work with the same file.)

> From a programmer's POV if 10 lines of flat-file munging are reduced
> to two lines of SQL its a reduction of 10 to 2.

The complexity exists in a variety of places. The two lines of SQL
hide a morass of potential complexity; so would a massive regex. The
file itself is way harder for external tools to manage. And all of it
can be buggy. With a simple flat-file system, chances are you can turn
it into a nested list structure and a dict for indexing (or possibly a
collections.OrderedDict), and then you have the same reduction - it's
just simple in-memory operations, possibly followed by a save() call.
All the options available will do that, whether flat-file or database.

>> I don't see what the Python devs have to do with it. They don't use
>> Sqlite for Python's internals, and the fact that there is a module for
>> sqlite doesn't mean squat. There's a module for parsing Sun AU audio
>> files, that doesn't mean the Python devs recommend that they are the best
>> solution to your audio processing and multimedia needs.
>
> Python made a choice to include AU file support when Sun existed and
> looked more respectable than MS. Today the support continues to exist
> probably for backward compatibility reasons.  "The code's already
> written. Why remove it?"
> Sure but it has its costs -- memory footprint, sources-size etc --
> which are deemed negligible enough to not bother.

Actually, this is one place where I disagree with the current decision
of the Python core devs: I think bindings for other popular databases
(most notably PostgreSQL, and probably MySQL since it's so widely
used) ought to be included in core, rather than being shoved off to
PyPI. Databasing is so important to today's world that it would really
help if people had all the options right there in core, if only so
they're more findable (if you're browsing docs.python.org, you won't
know that psycopg is available). Currently the policy seems to be "we
don't include the server so why should we include the client"; I
disagree, I think the client would stand nicely on its own. (Does
Python have a DNS server module? DNS client? I haven't dug deep, but
I'm pretty sure I can do name lookups in Python, yet running a DNS
server is sufficiently arcane that it can, quite rightly, be pushed
off to PyPI.) But this is minor, and tangential to this discussion.

> Faulty generalization fallacy:
> http://en.wikipedia.org/wiki/Faulty_generalization
> Because some code in firefox is bad, every choice of firefox is bad?

It's a matter of windows into the philosophy, rather than specific
examples. Requiring nine files to do a "Hello World" extension
suggests a large corpus of mandatory boilerplate; imagine, for
instance, that my example bookmarks file structure had demanded
_every_ attribute be provided for _every_ bookmark, instead of
permitting the defaults. That would demonstrate overkill in design,
and the sort of person who would produce that is probably unable to
simplify code for the same reasons.

> As for Durability, if you randomly turn off your machine when your
> program is running, yes you may lose the results of your program. You
> may lose much else!
>
> IOW if you are alone on your machine, all discussion of ACID is moot

No, no, a thousand times no! If I am doing financial transactions,
even if I'm alone on my machine, I will demand full ACID compliance.
Randomly turning off the machine is a simulation of the myriad
possible failures - incoming power failure (or UPS failure, if you
have one), power supply goes boom, motherboard gets fried, operating
system encounters a hard failure condition, cleaning lady unplugs the
server to put her vacuum cleaner onto the UPS... anything. The point
of ACID compliance is that you might lose the results of *this run* of
the program, but nothing more; and if any other program has been told
"That's committed", then it really has been. Without some such
guarantee, you might lose *all the data you have stored*, because
something got corrupted. Partial guarantees of acidity are
insufficient; imagine if power failure during ALTER TABLE can result
in your whole database being unreadable.

With the setup I described above, everything works beautifully if the
OS guarantees an atomic mv() operation. Even if it doesn't, you can
probably figure out what's going on by inspecting the file state; for
instance, you can assume that a non-empty main file should be kept
(discarding the temporary), but if the main file is empty or absent
AND the temporary is readable and parseable, use the temporary. (This
assumes that a fresh install creates a non-empty file, otherwise
there's ambiguity at initial file creation which would need to be
resolved. But you get the idea.)

Of course, that uber-simple option does require a full file rewrite
for every edit. But like I said, it's designed for simplicity, not
concurrent writing.

ChrisA