NoSQL Movement?

Jonathan Gardner jgardner at jonathangardner.net
Mon Mar 15 00:57:58 EDT 2010


On Sun, Mar 14, 2010 at 6:55 AM, D'Arcy J.M. Cain <darcy at druid.net> wrote:
> On Sat, 13 Mar 2010 23:42:31 -0800
> Jonathan Gardner <jgardner at jonathangardner.net> wrote:
>> On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.email at nospam.invalid> wrote:
>> > "D'Arcy J.M. Cain" <darcy at druid.net> writes:
>> >> Just curious, what database were you using that wouldn't keep up with
>> >> you?  I use PostgreSQL and would never consider going back to flat
>> >> files.
>> >
>> > Try making a file with a billion or so names and addresses, then
>> > compare the speed of inserting that many rows into a postgres table
>> > against the speed of copying the file.
>
> That's a straw man argument.  Copying an already built database to
> another copy of the database won't be significantly longer than copying
> an already built file.  In fact, it's the same operation.
>

I don't understand what you're trying to get at.

Each bit of data follows a particular path through the system. Each
bit of data has its own requirements for availability and consistency.
No, relational DBs don't have the same performance characteristic as
other data systems because they do different things.

If you have data that fits a particular style well, then I suggest
using that style to manage that data.

Let's say I had data that needs to hang around for a little while then
disappear into the archives. Let's say you hardly ever do random
access on this data because you always work with it serially or in
large batches. This is exactly like the recipient d

>> Also consider how much work it is to partition data from flat files
>> versus PostgreSQL tables.
>
> Another straw man.  I'm sure you can come up with many contrived
> examples to show one particular operation faster than another.
> Benchmark writers (bad ones) do it all the time.  I'm saying that in
> normal, real world situations where you are collecting billions of data
> points and need to actually use the data that a properly designed
> database running on a good database engine will generally be better than
> using flat files.
>

You're thinking in the general. Yes, RDBMs do wonderful things in the
general cases. However, in very specific circumstances, RDBMS do a
whole lot worse.

Think of the work involved in sharding an RDBMS instance. You need to
properly implement two-phase commit above and beyond the normal work
involved. I haven't run into a multi-master replication system that is
trivial. When you find one, let me know, because I'm sure there are
caveats and corner cases that make things really hard to get right.

Compare this to simply distributing flat files to one of many
machines. It's a whole lot easier to manage and easier to understand,
explain, and implement.

You should use the right tool for the job. Sometimes the data doesn't
fit the profile of an RDBMs, or the RDBMs overhead makes managing the
data more difficult than it needs to be. In those cases, it makes a
whole lot of sense to try something else out.

>> >> The only thing I can think of that might make flat files faster is
>> >> that flat files are buffered whereas PG guarantees that your
>> >> information is written to disk before returning
>> >
>> > Don't forget all the shadow page operations and the index operations,
>> > and that a lot of these operations require reading as well as writing
>> > remote parts of the disk, so buffering doesn't help avoid every disk
>> > seek.
>
> Not sure what a "shadow page operation" is but index operations are
> only needed if you have to have fast access to read back the data.  If
> it doesn't matter how long it takes to read the data back then don't
> index it.  I have a hard time believing that anyone would want to save
> billions of data points and not care how fast they can read selected
> parts back or organize the data though.
>

I don't care how the recipients for the email campaign were indexed. I
don't need an index because I don't do random accesses. I simply need
the list of people I am going to send the email campaign to, properly
filtered and de-duped, of course. This doesn't have to happen within
the database. There are wonderful tools like "sort" and "uniq" to do
this work for me, far faster than an RDBMS can do it. In fact, I don't
think you can come up with a faster solution than "sort" and "uniq".

>> Plus the fact that your other DB operations slow down under the load.
>
> Not with the database engines that I use.  Sure, speed and load are
> connected whether you use databases or flat files but a proper database
> will scale up quite well.
>

I know for a fact that "sort" and "uniq" are far faster than any
RDBMs. The reason why is obvious.

-- 
Jonathan Gardner
jgardner at jonathangardner.net



More information about the Python-list mailing list