How would you design scalable solution?

Wed Oct 28 02:20:20 EDT 2009

On 27 Oct, 17:10, Bryan <bryanv... at gmail.com> wrote:
> I'm designing a system and wanted to get some feedback on a potential
> performance problem down the road while it is still cheap to fix.
>
> The system is similar to an accounting system where a system tracks
> "Things"
> which move between different "Buckets".  The system answers these
> questions:
> - How many Things are in this Bucket today?
> - Move Things from Bucket1 to Bucket2...
> - Now how many Things are in each Bucket?
>
> So each time a Thing is moved between Buckets, I imagine a DB row like
> this:
>  | id | thingId | toBucket | fromBucket | qty |
>
> Then to find how many Things are in a certain Bucket:
> 1. Start with initial qty in the bucket at the beginning of time
> 2. *Add* all qty moved *to* the bucket since beginning of time
> 3. *Subtract* all qty moved *from* the bucket since beginning of time
>
> Simple system to "account" for Things.
>
> My problem is this.  This design will inherantly get slower as time
> goes on.
> As the number of rows recording a transfer between buckets increases,
> the query
> to see how many Things are in a Bucket will get slower.  I experience
> this when
> I use gnucash (which I love).  I don't do "closing entries" at the end
> of the
> year, so each account has every transaction I have every made.  I see
> it getting
> slower.  It is nothing I am going to do anything about, because it is
> still fast
> enough for me.  But I have to wonder how big companies with thousands
> of
> transactions a day do this?
>
> One solution would be to do a "closing entry" at certain periods in
> time, so old
> info would be archived.  Each bucket would start the new time period
> with a
> balance of Things equal to what it was at the point in time we
> "closed".
>
> How else to keep a record of every transaction, but not have the speed
> of the
> question "How many Things in Bucket x" depend on looking @ every
> transaction
> record ever made?

As well as what Jonathan has said, the following might be worth
researching:

() "Table Partitioning" - a specific example for postgres is here:
http://www.postgresql.org/docs/8.1/static/ddl-partitioning.html.
(Although other DB's support similar techniques)
() Shared Nothing Architecture (aka. 'Sharding') with appropriate
'buckets' which is massively scalable, but I'm guessing overkill :)
http://en.wikipedia.org/wiki/Shared_nothing_architecture - IIRC
SQLAlchemy has a basic implementation of this in its examples.

Cheers,

Jon.