Bad news
Theory and consequences
Pick any two:
- Consistency
- Availability
- Partition tolerance
What does that even mean?
- C: linearizability (~ local behavior)
- A: all active nodes answer every query
- P: resistance to failures
Can't sacrifice partition tolerance
- Partition tolerance is failure tolerance
- Networks, nodes fail all the time
- Latency happens; indistinguishable
- P(no failures) < 1 - P(one node works)N
CA (!P)
"half of the time it doesn't actually work"
CP example: Zookeeper
Consistent ops that sometimes fail (!A)
AP example: Cassandra
Inconsistent ops (!C) that (usually) succeed
Informally
Let's look at a 5 node cluster
What happens when stuff fails?
- I still want to read/write!
- Idea: (N+1)/2 quorum
Failures are indistinguishable
Too many failures, no quorum
Simple quorum isn't enough
Not far-fetched
Real failures are:
Back to reality
- CAP's C is linearizability
- CAP's A is any op on any node
- These are very strong guarantees!
Trade-offs
Availability |
Consistency |
Performance |
Ease of reasoning |
Scalability |
Transactionality |
No universally correct choice!
Example of creative sacrifice: etcd
- Normally: consistency all the way
- Option of doing inconsistent reads
- Maybe get some stale data
- … but still works under partial failure
One process, one register
Two processes, one register
This is how we expect stuff to work
We are a spoiled bunch
Writes don't replicate instantly
Writes can get reordered
All sorts of stuff can happen
- Multiple registers
- More semantics
- More nodes
- More failure modes
Reasoning about the system
- What can and can't happen?
- What can happen: consistency model
Theoretical consistency models
Serializability
- ∃ serial execution with the same result
- Some serial execution: fairly weak
- No restrictions on which one
Example: serializability being weak
Precondition: x = 0
- x ← 0
- x ← 1
- x ← 2
Example: serializability being strong
Precondition: x = y = 0
- y ← 2, assuming y = 1
- x ← 1, assuming x = 0
- y ← x, assuming x = 1
Linearizability
All operations appear to happen instantly
Strong serializability
Linearizable & serializable
Your computer is distributed
Models in "centralized" systems
- SQL databases
- Clojure reftypes
Twisted vs threads
In terms of concurrency models
- Twisted: strongly serializable
- Event loop with 1 reactor thread
- Serializable: reactor finds the ordering
- Linearizable: callbacks run by themselves
- Threads: no defined model
- Unladen Swallow tried to figure it out
- Nothing fancy; whatever your CPU gives you
- Probably okay (heap + GIL)
- Correct use of locks?
Global clock
- Everyone sees the same clock
- Access instant, uncertainty 0
- Can compare different timestamps
- Mental model: wallclock
Local clocks
- Each clock is kinda reliable
- Can't compare with other timestamps
- Mental model: stopwatch
Can't have a global clock
- Can pretend they almost exist
- Going to be wrong often
Example: Google Spanner
- GPS & atomic clocks
- "Atomic clocks […] drift significantly"
- "uncertainty […] generally <10ms"
Can't have no clock
- Need time for failure detection
- FLP result says nothing works
Timestamps are often a proxy
- Actually care about progression, partial order
- Timestamps don't have to match real-world time
Example timeline
Sequential numbers vs real timestamps?
Lamport clocks
(informally)
keep a version number of what you've seen
Vector clocks
(informally)
keep version numbers of what you've seen other nodes see
Good news
Stuff you can rely on
Consensus protocols
Getting computers to agree on things
Examples
- ZAB (Zookeeper)
- Paxos* (Chubby)
- Raft (
etcd
)
Recipes
On top of consensus protocols:
- Locks
- Barriers
- Set partitioning
- …
Set partitioning
{1, 2, 3, 4, 5}
CRDTs
Conflict-free replicated data type
Problem
- Read, compute, write back
- Concurrency: multiple results
- Conflicts!
Solutions?
- Last write wins? Most writes lose :-(
- Coordination? Expensive! :-(
I want highly available data stores…
.. but I don't want nonsense data
Idea!
- Describe what you want
- Describe conflict resolution
Specializations
The C in CRDT can mean:
- Commutative (CmRDT)
- Convergent (CvRDT)
Commutative RDTs
- Broadcast operations
- Merge operation:
- Commutative: f(x, y) = f(y, x)
- Associative: f(f(x, y), z) = f(x, f(y, z))
- Not idempotent f(x, y) != f(f(x, y), y)
Example: integers
- +1, -2, +3, +5, -4: +3
- Always get same answer:
- As long as I see all ops once
- Duplicate an op, get wrong answer
- Order doesn't matter, though
Convergent RDTs
- Broadcast (sometimes partial) states
- Merge operation has many properties:
- Commutative: f(x, y) = f(y, x)
- Associative: f(f(x, y), z) = f(x, f(y, z))
- Idempotent: f(x, y) = f(f(x, y), y)
- Informally: apply lots until done
Simple CvRDT conflict resolution
Complex CvRDT conflict resolution
It's okay if you see writes more than once!
CRDTs in practice: usually CvRDT
Solve local problem once
vs
Solve distributed problem constantly
Examples
- Counters (G, PN)
- Sets (G, 2P, LWW, PN, OR)
- Maps (sets of (k, v) tuples)
- Graphs (using multiple sets)
- Registers (LWW, MV)
- Sequences (continuous, RGA)
Using CRDTs
- Designing them is tricky
- Using them is fairly easy
Riak <3
Flags, registers, counters, sets, maps