NoSQL Movement?

Thu Mar 4 14:15:42 EST 2010

mk <mrkafk at gmail.com> wrote:

> Duncan Booth wrote:
> 
>> If you look at some of the uses of bigtable you may begin to
>> understand the tradeoffs that are made with sql. When you use
>> bigtable you have records with fields, and you have indices, but
>> there are limitations on the kinds of queries you can perform: in
>> particular you cannot do joins, but more subtly there is no guarantee
>> that the index is up to date (so you might miss recent updates or
>> even get data back from a query when the data no longer matches the
>> query). 
> 
> Hmm, I do understand that bigtable is used outside of traditional 
> 'enterprisey' contexts, but suppose you did want to do an equivalent
> of join; is it at all practical or even possible?
> 
> I guess when you're forced to use denormalized data, you have to 
> simultaneously update equivalent columns across many tables yourself, 
> right? Or is there some machinery to assist in that?

Or you avoid having to do that sort of update at all.

There are many applications which simply wouldn't be applicable to 
bigtable. My point was that to make best use of bigtable you may have to 
make different decisions when designing the software.

> 
>> By sacrificing some of SQL's power, Google get big benefits: namely 
>> updating data is a much more localised option. Instead of an update 
>> having to lock the indices while they are updated, updates to
>> different records can happen simultaneously possibly on servers on
>> the opposite sides of the world. You can have many, many servers all
>> using the same data although they may not have identical or
>> completely consistent views of that data.
> 
> And you still have the global view of the table spread across, say, 2 
> servers, one located in Australia, second in US?
> 
More likely spread across a few thousand servers. The data migrates round 
the servers as required. As I understand it records are organised in 
groups: when you create a record you can either make it a root record or 
you can give it a parent record. So for example you might make all the data 
associated with a specific user live with the user record as a parent (or 
ancestor). When you access any of that data then all of it is copied onto a 
server near the application as all records under a common root are always 
stored together.

>> Bigtable impacts on how you store the data: for example you need to 
>> avoid reducing data to normal form (no joins!), its much better and 
>> cheaper just to store all the data you need directly in each record. 
>> Also aggregate values need to be at least partly pre-computed and
>> stored in the database.
> 
> So you basically end up with a few big tables or just one big table
> really? 

One. Did I mention that bigtable doesn't require you to have the same 
columns in every record? The main use of bigtable (outside of Google's 
internal use) is Google App Engine and that apparently uses one table.

Not one table per application, one table total. It's a big table.

> 
> Suppose on top of 'tweets' table you have 'dweebs' table, and tweets
> and dweebs sometimes do interact. How would you find such interacting
> pairs? Would you say "give me some tweets" to tweets table, extract
> all the dweeb_id keys from tweets and then retrieve all dweebs from
> dweebs table? 

If it is one tweet to many dweebs?

  Dweeb.all().filter("tweet =", tweet.key())
or:
  GqlQuery("SELECT * FROM Dweeb WHERE tweet = :tweet", tweet=tweet)

or just make the tweet the ancestor of all its dweebs.

Columns may be scalars or lists, so if it is some tweets to many dweebs you 
can do basically the same thing.

  Dweeb.all().filter("tweets =", tweet.key())

but if there are too many tweets in the list that could be a problem.

If you want dweebs for several tweets you could select with "tweet IN " the 
list of tweets or do a separate query for each (not much difference, as I 
understand it the IN operator just expands into several queries internally 
anyway).