[portland] Tokyo Talk Slides

Michael Schurter michael at susens-schurter.com
Wed Mar 11 18:34:45 CET 2009


Come and get 'em:
http://michael.susens-schurter.com/blog/2009/03/11/tokyo-cabinet-pytyrant-talk/

Someone asked me about data integrity last night, and I told a long
story about TCP backoff algorithm issues.  However, I forgot the
punchline (aka solution).  Here's a better explanation:

Lets say we have packets PA, PB, and PC:

PA - sent at 10:00:00 am from application node: NA
PB - sent at 10:00:01 am from application node: NB
PC - sent at 10:01:00 am from application node: NC

Unfortunately Mr. Sysadmin was doing a massive rsync while those
packets were trying to make their way from the application server to
the database (Tokyo Tyrant) server.  "Luckily" the C in TCP stands for
Control[1], so instead of losing data, the database server's operating
system tells senders to backoff for a second and try again later.

Now if only 1 connection was being used between the application nodes
and the database server, the operating system would insure all TCP
packets are processed in the order they were sent, regardless of in
what order they were received[2].  Unfortunately we have 3 nodes, and
therefore 3 separate TCP connections.  No spiffy guaranteed ordering
for us.

So here's the order the database server receives the packets after
telling our nodes to backoff because the rsync backup is saturating
its NIC: PB, PC, PA

All 3 have the same Key (say, a user's session key), so PA's data ends
up being the data written last.  When you read from this key again,
you'd expect to get PC's value, but instead you get PA.  Hilarity
ensues.  And by hilarity, I mean user data is seemingly randomly lost
and they see very strange behavior in their browser.

The Solution: a Lua extension to automatically timestamp when each key
was written.  However this takes cooperation from the client-side as
well.

The client writes a timestamp as the first X digits of the *value* for
every key they put (send to the Tokyo Tyrant).  The Lua extension
reads this timestamp and saves it in a field named "timestamp.$key"
(where $key is the key being saved).

The trick is if the timestamp for that key is *newer* than the
timestamp on the data that just came in, the Lua extension returns an
error and does *not* save the data (because its old).  In practice the
client actually just silently drops the error because if newer data
has already been sent, there's really nothing it needs to do.

If the timestamp for incoming data is *newer* than the saved
timestamp, the lua extension updates both the timestamp key and the
actual key we're trying to safely store.  And thats what happens
99.999999999999% of the time.

Its worth mentioning the Lua extension is *very* fast since its
running right on top of the local Tokyo Cabinet database.  So saving 2
key/value pairs instead of 1 does not in fact half your performance
since the bottleneck is between PyTyrant and Tokyo Tyrant.

Lessons learned:

1.  Saturating a network connection can cause very very strange things
to happen.
2.  All of TCPs fancy congestion control and ordering algorithms are
only beneficial if you pipe everything through 1 connection.

Hope that makes sense!

[1] http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Congestion_control
[2] http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Ordered_data_transfer.2C_retransmission_of_lost_packets_and_discarding_duplicate_packets


More information about the Portland mailing list