Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[dupe] Ex-Googlers CockroachDB: A Scalable, Geo-Replicated, Transactional Datastore (github.com/cockroachdb)
98 points by ethanjones on Nov 6, 2014 | hide | past | favorite | 66 comments


Are there distributed data stores like this that are also resilient to intentional sabotage?

I've been looking recently at long-term digital preservation systems -- tools designed to archive large amounts of data for decades. This is the Library of Alexandria problem -- how do we preserve all this data we're generating against once-in-a-century disasters?

So this 2005 paper lists thirteen different threats to long-term archives: Media Failure; Hardware Failure; Software Failure; Communication Errors; Failure of Network Services; Media & Hardware Obsolescence; Software Obsolescence; Operator Error; Natural Disaster; External Attack; Internal Attack; Economic Failure; Organizational Failure.[1]

Fault-tolerant distributed data stores are exciting, because they solve a bunch of those problems off the bat -- media failure, hardware failure, communication errors, failure of network services, hardware obsolescence, and natural disaster.

They also help to address software failure, software obsolescence, and economic failure, because archival projects are always strapped for resources and it's great to rely on tools that exist for totally distinct, commercially-valuable reasons.

But that still leaves operator error, external attack, and internal attack -- burning down the Library.

Hence my original question: are there distributed data stores that can be configured to resist intentional destruction of data?

[1] http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.ht...


The problem with internal/external attacks is that we (the society) don't really want to prevent it. The reason is simple: child porn. To date, Bitcoin block chain (and related ideas) is the only data-storage that is 100% resistant to attacks (i.e. changing history), but luckily it cannot handle amounts of data large enough to be viable for child porn (or most other forms of media). Tor, on the other hand, gets a bad rep precisely because it doesn't prevent it (despite its numerous other, beneficial, uses).

The core of the issue is that humans view different information differently (child porn vs. Mona Lisa), whereas for computers, bits are bits and numbers are numbers. As long as child porn remains illegal and socially unacceptable, we'll want to enable attacks on data, i.e. for someone (usually internal operators) to be able to delete some kind of information, corrupt it or at least track it. Of course, this necessarily means that all information stored in the same data-store will be vulnerable.


You're conflating the archival properties of the medium with the decision about what to save. Oil paint on canvas is durable. It doesn't mean that a museum needs to retain every piece of crap that anyone paints.


The problem is that removal of content because it's crap/immoral versus operator destruction is not a meaningful distinction, from a software perspective.

So it would probably need to be write-only to prevent people from burning it down, which would necessarily mean that, once content is included, it cannot be modified or removed.


Journaling or storing incremental backups (perhaps offline?) of validated/verified checkpoints may address this, although it sounds like something you wouldn't be happy with since it's not a 'built-in' feature but an additional backup & maintenance process that a system administrator would need to implement.

I guess you're asking whether there exists a distributed fault-tolerant with a form of version control (similar to git/cvs/perforce) as part of the native feature set.




LOCKSS is definitely a giant in this field, and David Rosenthal (who wrote the paper I linked as well) is great.

But LOCKSS occupies a small niche. My hope is really that at some point a commercially-focused project with a ton of engineering effort and battle testing behind it will displace a lot of what LOCKSS has had to do manually. Seems like that might happen as web services get more and more distributed and fault-tolerant.


> are there distributed data stores that can be configured to resist intentional destruction of data?

Well, Git has checksums on everything.


Why the 'Ex-Googlers' in the title. Is it like a seal of approval or something?


I think that refers to the Wired article http://www.wired.com/2014/07/cockroachdb/

With the tag being sign of pedigree. "A scalable DB from people who worked at a place with huge scale DBs".

Whether it is very convincing or not is another matter, but I am sure it gets more clicks/attention


Yes, it's called social proof. If you're observant you'll notice that people use it everywhere. Of course, social proof doesn't mean guarantees.


Nah, it's honor-by-association fallacy. Social proof is a behavioral thing, not false reasoning.


Yeah, I feel a bit uncomfortable about using the brand of a previous employer so prominently to promote a different project


Haven't you noticed the pedigree obsession of some in our field?


In this context it adds more signal than noise. There are zillions of open source projects but when you need to use them in production a very small subset of this universe is ready.


Does it? I don't think ex-Googlers gives anything of value here.


It indicates that the project is worth a closer look rather than waiting or dismissing it outright.


Why? Being an ex-googler could just as easily imply they were incompetent and fired.


So, you're assuming that half of all ex-googlers were fired? Otherwise, there aren't equal probabilities.


> It indicates that the project is worth a closer look rather than waiting or dismissing it outright. reply

I'm curious, could you explain why? Google is an incredibly large company with many developers who never even touch their data systems so to me saying ex-Googler really doesn't mean anything beyond that they're probably a senior developer considering how rigorous (and honestly some old hat) their interview process is. But that doesn't change my viewpoint of the project at all.


You can take a look at the project author profiles.


I was kind of put off by this as well. Does where they worked in the past have anything to do with how useful the project is now or how talented the people are? What if they're doing it on their own because Google thought the project was useless? What if they were fired from Google, or only worked at Google for a month? How does the founders' past involvement with Google mark this project as any more interesting than any other random "Show HN" project?


It is a pretty strange title. Some might interpret it as they started this project after being fired from Google.


That's kind of how I read it.


I don't think it matters how you try to promote an open source project. The code will ultimately be the determining factor whether the project can support a following. And everyone has access to the code to make that determination independently.


That seems rather naive to me. No one is scouring GitHub looking for quality code and bringing unrecognized projects to light.


My point is not that unfound projects will necessarily become found. No matter how good they are. Instead the idea is that no matter how a project is promoted, ultimately they only have their code-base to back their claims.


A requirement of a successful project/product is visibility. Ex-Googlers does not imply much more then "used to work there" but if it drives more attention to them, it's good MarCom.


This was based on an internal Google technology, so the "Ex-Googlers" label is actually quite relevant.


Name-dropping has been common since the first salesman.


So I cannot tell if this is aiming to be CA or AP. Having beaten my head against the CAP wall for a while, how does it deal with partitions?


The design doc https://docs.google.com/document/d/11k2EmhLGSbViBvi6_zFEiKzu... says, "TBD: how to avoid partitions? Need to work out a simulation of the protocol to tune the behavior and see empirically how well it works"

.. so they've gone for CA and forgotten about P.


You're taking a sentence out of context and giving it the most uncharitable possible interpretation. They certainly haven't "forgotten" about network partitions because if you actually read the design document, instead of ctrl-F'ing for the word "partition", they talk about the mechanisms they use to ensure sequential consistency. The software is not yet at the point of being testable AFAIK, but clearly the intent is to build a CP system.

The section you're quoting is discussing a separate gossip protocol that is used to lazily propagate node state information. It does not affect the consistency of actual data replicas.


I'll admit to aggressive use of ctrl-F, but only because I too felt they should be a lot clearer about what the intended properties are.

From the intro page:

"Cockroach is a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention."

If we read 'survivability' as 'availability', then that would suggest they've gone for CA. Although closer inspection reveals that their architecture seems to be made of shards each of which is maintained with Raft/Paxos. An evaluation of this by the Cambridge Computer laboratory is here: http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf

That report makes two points relevant to this discussion. One is that a hard definition of C A and P can be difficult and that it's possible to achieve all three almost all of the time in real conditions. The other is from the conclusion:

"In particular, a [Raft] cluster can be rendered useless by a frequently disconnected node or a node with an asymmetric partition"


Here's how I understand it: if you claim that your database cannot be taken down by component failure, you have to necessarily consider network partitions as well as individual node failure. If you are node A talking to node B, and node B stops responding you cannot distinguish between a node failure and a network failure. In order to claim high availability you must build an AP system.

Let's reduce this case to a multi-master setup where a client can connect to any node and write to it. If a node fails outright and a client tries to connect, no big deal: the client chooses a different node, the failed node eventually comes back online later, catches up, then says "OK, write to me!" opening a listening socket.

However, if a partition happens, and client X writes to node A, client Y writes to node B, and then the two nodes cannot agree on the correct data, then you lose consistency. You can of course say that no node can be written to if other nodes are offline, which means the system is not highly available.

So their stated goal: "The primary design goal is global consistency and survivability..." either implies that high availability is not something they are going for, or that they are shooting for something that is not theoretically possible.

Of course all of the above is just my understanding, not necessarily fact, so please correct me if I'm wrong.


> In order to claim high availability you must build an AP system.

I strongly disagree. The "A" in CAP describes a system such that any single non-failing node can always make progress. That would be a nice property to have, but it's much stricter than is required for a real system.

If a distributed database is resilient to failure of a minority of nodes, it still makes sense to describe it as high-availability. And that is exactly what a consensus algorithm like Paxos or Raft gives you.


Georeplication that can't handle partitions? Sign me up! I'll have two (or more!)!


Usually when building a distributed database, "TBD: how to avoid partitions?" is not what you want to see except on the initial whiteboard discussion before a single line of code is written. But what do I know, I am not an Ex-Googler.


First off, you can't have CA in a distributed system, but I'm guessing that was a typo and you meant CP or AP. Given that it is based on spanner, and claims to have something close to linearizability, it is CP.


From what I understand, in real world distributed systems, you will always have P (despite what the name suggests, it's pretty much failures not just network partitions). The question is whether they'll choose A over C when a partial P happens (say you have 3 nodes and 2 go down, you have quorum loss, will you still keep the replica available and replicate lazily or will you make it unavailable until a 2nd replica comes up?). I'm guessing it'll be tunable/configurable in this case.


Here in the Southeast, we call them "Palmetto Bugs". Maybe a rename to PalmettoDB? :)


Palmetto bugs are generally the larger, flying "American Cockroach." "German Cockroaches" are smaller, don't fly but are generally the ones that cause infestations. I grew up in the Northeast and German Cockroaches are all I ever saw up there. I've been down in Atlanta for years now and I only see Palmetto bugs here.

Unless this database can fly, I'd say they've chosen the right name. ;)


South Carolina? :P


Comparing to Facebook ' s TAO approach from yesterday post, I like the FB small set of api approach better. But that mainly focus on handling social graph with objects and associations.

On the other hand, almost all my backend related features can be easily abstracted to those APIs.


> Comparing to Facebook ' s TAO approach from yesterday post

Link? I see no relevant-looking "Facebook" or "TAO" in my recent RSS entries :(


I think he is talking about this post: https://news.ycombinator.com/item?id=8557408


Many things that I don't agree with: https://github.com/cockroachdb/cockroach/

MySQL: Weak consistency

Cassandra: No availability or weak consistency with datacenter failure


MySQL provides strong consistency when run on a single machine, but that breaks down when you handle failover using asynchronous replication between DCs.


To add to this "Postgres: Limited scalability". I wish someone told us before we ticked over to 31TB... 5 Months ago.


I hope Ceph can takes cues from this and be able to do geo-replication at scale.


Are there any details on how the distributed joins are achieved? (Sorry if that detail is in the design doc, my access to google drive is blocked from work).


It's really strange to see networked databases go through this iterative design fad. It was file transfers back in the 90s/00s; everyone had their own distributed decentralized file transfer solution. Little known fact, Gentoo's Portage almost became an internet-wide distributed decentralized public file system. Thank god they abandoned that idea. Can you imagine trying to debug a file transfer error just to get an mp3 player to install on your machine? I can't wait until databases go back to being mainframes.


> Little known fact, Gentoo's Portage almost became an internet-wide distributed decentralized public file system.

Am I the only person really sad that this didn't happen? after using apt over Tahoe-LAFS (over I2P - KillYourTV's PPA is on clearnet and I2P), I wanted this to be the default behavior for apt.


awful name!


I think the sentiment was "hard to kill" and/or become extinct due to its resiliency. But I agree, they should have picked something else.


Good point about the likely reason.

Hydra could have been another good choice :)

http://en.wikipedia.org/wiki/Hydra

See first entry at above Wikipedia page, about the many-header serpent.


I think the name's great. A refreshing change from all the synthetic cutesy crap. Bubblegumlydb?


Happy to see it's written in Golang.


As a Go developer I was happy as well.

On the other hand this question immediately popped into my mind: What kind of overhead does the GC incur, and how does it affect processes like a database where low latency is desired?


What kind of overhead does the GC incur, and how does it affect processes like a database where low latency is desired?

The eventual goal (golang v1.5 IIRC) is to have 40ms out of every 50ms available for actual processing. This is the kind of 'soft real time' that should provide good responsiveness for most clients most of the time.


If I had to guess, I would say network and disk I/O will be the bottlenecks. Disk I/O less so because distributed system, and SSD. I imagine if anyone can solve GC issues, though, I would bet on Google :)


Their chosen solutions to the GC issues are in this roadmap. [0]

[0]: https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLo...


Awful name!!! Why Cockroach..??


awful awful name... folks didnt get anything than cockroach..


the name feels sick




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: