So I cannot tell if this is aiming to be CA or AP. Having beaten my head against...

pjc50 · on Nov 6, 2014

The design doc https://docs.google.com/document/d/11k2EmhLGSbViBvi6_zFEiKzu... says, "TBD: how to avoid partitions? Need to work out a simulation of the protocol to tune the behavior and see empirically how well it works"

.. so they've gone for CA and forgotten about P.

teraflop · on Nov 6, 2014

You're taking a sentence out of context and giving it the most uncharitable possible interpretation. They certainly haven't "forgotten" about network partitions because if you actually read the design document, instead of ctrl-F'ing for the word "partition", they talk about the mechanisms they use to ensure sequential consistency. The software is not yet at the point of being testable AFAIK, but clearly the intent is to build a CP system.

The section you're quoting is discussing a separate gossip protocol that is used to lazily propagate node state information. It does not affect the consistency of actual data replicas.

pjc50 · on Nov 6, 2014

I'll admit to aggressive use of ctrl-F, but only because I too felt they should be a lot clearer about what the intended properties are.

From the intro page:

"Cockroach is a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention."

If we read 'survivability' as 'availability', then that would suggest they've gone for CA. Although closer inspection reveals that their architecture seems to be made of shards each of which is maintained with Raft/Paxos. An evaluation of this by the Cambridge Computer laboratory is here: http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf

That report makes two points relevant to this discussion. One is that a hard definition of C A and P can be difficult and that it's possible to achieve all three almost all of the time in real conditions. The other is from the conclusion:

"In particular, a [Raft] cluster can be rendered useless by a frequently disconnected node or a node with an asymmetric partition"

IgorPartola · on Nov 6, 2014

Here's how I understand it: if you claim that your database cannot be taken down by component failure, you have to necessarily consider network partitions as well as individual node failure. If you are node A talking to node B, and node B stops responding you cannot distinguish between a node failure and a network failure. In order to claim high availability you must build an AP system.

Let's reduce this case to a multi-master setup where a client can connect to any node and write to it. If a node fails outright and a client tries to connect, no big deal: the client chooses a different node, the failed node eventually comes back online later, catches up, then says "OK, write to me!" opening a listening socket.

However, if a partition happens, and client X writes to node A, client Y writes to node B, and then the two nodes cannot agree on the correct data, then you lose consistency. You can of course say that no node can be written to if other nodes are offline, which means the system is not highly available.

So their stated goal: "The primary design goal is global consistency and survivability..." either implies that high availability is not something they are going for, or that they are shooting for something that is not theoretically possible.

Of course all of the above is just my understanding, not necessarily fact, so please correct me if I'm wrong.

teraflop · on Nov 6, 2014

> In order to claim high availability you must build an AP system.

I strongly disagree. The "A" in CAP describes a system such that any single non-failing node can always make progress. That would be a nice property to have, but it's much stricter than is required for a real system.

If a distributed database is resilient to failure of a minority of nodes, it still makes sense to describe it as high-availability. And that is exactly what a consensus algorithm like Paxos or Raft gives you.

e12e · on Nov 6, 2014

Georeplication that can't handle partitions? Sign me up! I'll have two (or more!)!

rdtsc · on Nov 6, 2014

Usually when building a distributed database, "TBD: how to avoid partitions?" is not what you want to see except on the initial whiteboard discussion before a single line of code is written. But what do I know, I am not an Ex-Googler.

grogers · on Nov 6, 2014

First off, you can't have CA in a distributed system, but I'm guessing that was a typo and you meant CP or AP. Given that it is based on spanner, and claims to have something close to linearizability, it is CP.

curiousDog · on Nov 6, 2014

From what I understand, in real world distributed systems, you will always have P (despite what the name suggests, it's pretty much failures not just network partitions). The question is whether they'll choose A over C when a partial P happens (say you have 3 nodes and 2 go down, you have quorum loss, will you still keep the replica available and replicate lazily or will you make it unavailable until a 2nd replica comes up?). I'm guessing it'll be tunable/configurable in this case.