DistKV overview
Re-thinking key-value storage
The CAP conundrum
Suppose you have a campus with some buildings. Or a network with some long-distance links between them.
Suppose further that you want some sort of configuration management / storage for that.
Suppose also that you can't tolerate one side being read-only during a network outage.
So, a centralized system won't work. Neither will anything that requires a consensus protocol, for obvious reasons. You can do site-local storage and do a read-only mirror to the others, but what's a "site"? Any single system, or cluster of systems can be disconnected while somebody reconfigures a switch.
The key insight to get around this problem is that while in theory any entry in our system can be changed from anywhere, most won't be. The thermometer located at X will talk to the computer closest to it, and so will the thermometers at Y and Z. The value of X won't ever be changed from Y or Z.
Thus, instead of building site-local storages and mirroring them, DistKV is based on data structures with site-local semantics that are distributed.
DistKV does this by attaching a small change record to each data set. It typically won't be large because we assume that if system A changes a record it knows when it last did that, so its previous change can be elided. Changes are numbered sequentially, thus everybody knows when a record is skipped.
In other words, if we imagine a counter that jumps randomly between systems A, B and C, we might find:
1 < A/13 2 < B/10 < A/13 3 < B/14 < A/13 4 < C/25 < B/14 < A/13 5 < B/16 < C/25 < A/13 6 < A/14 < B/16 < C/25
Thus if C misses record #5 and sees #6, it immediately knows that
B/16 (and maybe B/15) are missing. Go ask.
If #5 arrives late, it's stale.
The chain obviously needs to be long enough to be unambiguous.
Recovery
In a network of 100 instead of 3 systems, when message #6 arrives not all of them should ask for the missing record(s) immediately. Related to this problem: when a network split is healed, we shouldn't have to wait for #7 to discover that we're missing some data.
DistKV solves this by using an Actor model. Each node randomly sends a token every N-plus-X seconds. The token consists of the node's global and local change counters and a list of senders of the last few tokens.
The length of that list depends on how many network splits you can possibly have, i.e. the longes path between two network switches, plus two.
X depends on the potential sender's current position in the list. Those that aren't on the list send some time later, except that with some small probability (which is somewhat larger when the list isn't full) their X is somewhat smaller.
Any sender which receives a token before sending its own will not send in this time slot. Collisions can still happen, of course, but that doesn't matter as long as (a) delivery is either reliable or doesn't happen, and (b) resolving conflicts is deterministic. DistKV (or rather, the AsyncActor package) assures (b); (a) is the underlying network's problem.
Since each token carries the same kind of cchange record history as above. Thus we can reliably know whether an incoming message refers to the previous round or indicates a healed network split.
When a split recovers we have two (or more) old actors. Each of those sends their list of known nodes, and their maximum change counters. Recipients of these messages thus know which changes they're missing out on and can broadcast a request for them. One of the other actors will either know the missing data.
Each of these messages is timeslotted so that the next actor in the list(s) can take over if the one(s) in front of it doesn't respond.
This model has three flaws.
Entries may be superseded
In the above example, if a new node sends a change:
7 < D/33 < A/14 < B/16
the record for C/25 has dropped off the network. Similarly, A/13 and B/10 have been dropped; the difference is that in the latter cases the node dropping the entry has superseded it itself. In both cases, there's a missing change which might cause a node E, having re-joined the network after a split, to think that there's some data missing. To fix this, DistKV maintains a list of known-superseded entries. Updates to that list are broadcast whenever required.
Deletion
This model prevents you from ever completely deleting any entry and its associated history – you can't know that everybody has seen the deletion request, and you can't remove an entry whose changes are all on the "known" list because another node's missing data may appear later. To fix this, DistKV maintains separate "Deleter" actors. For resiliency, this list must contain
each node that holds permanent storage
one node in each network segment that contains more than one DistKV server
If the list of Deleters is complete, DistKV knows that any entry that's been seen by all nodes on the Deleter list can be removed from storage.
Network reliability
If the network is unreliable, i.e. exhibits any packet loss other than 0% or 100%, the Actor protocol breaks down.
Currently: DistKV uses MQTT and crosses its fingers that this does't happen. It's TCP, those Mosquitto servers are reliable enough.
Future plan: To fix this, DistKV relies on two separate methods for message distribution: Serf and MQTT.
Serf implements a UDP-based network flooding algorithm that works around packet losses very reliably. However, if a network breach is healed it may or may not transmit some, any, or all messages that have not yet reached the other side. Also, it delays forwarded messages by 100ms even if message consolidation is turned off, which is not acceptable for some situations.
In contrast, MQTT is very fast, but it may lose messages when a connection between two servers breaks and needs to be re-established. Also, there's no way to connect servers redundantly (at least not when using the Mosquitto server).
DistKV works around MQTT's possible data loss problem by using Serf as a backup transport.
DistKV works around Serf's breach recovery problem by ignoring any message that's too old. How to determine that is TBD.
Wire data format
Here DistKV is quite opinionated. Everything on the wire (or on disk) uses MsgPack.
MsgPack is a compact, binary, self-describing, extensible protocol which is self-delimiting. Thus no length prefixes (other than for simple strings, binary blobs, and protocol extensions, of course), no escaping of "special" bytes like quotes or backslashes (ever), and trivial embedding of one message in another. MsgPack distinguishes UTF-8 strings from binary blobs and integers from floating-point numbers. While the protocol is limited to 64-bit integers, it's trivial to code an extension for larger numbers. Its protocol isn't textual, but a human debugger can still decode the bytestream easily enough.
JSON, in contrast, can do none of the above.
Protobuf is not self-delimiting. It is not a universal solution because it cannot carry arbitrary data structures.