These guys are lucky they didn't try Cassandra. That's really Mongo's problem: i...

dindresto · on May 11, 2013

"These guys are lucky they didn't try Cassandra." Could you explain that further? I'm currently using Cassandra for a new project, so this sentence caught my eyes.

advisory5739f2 · on May 11, 2013

Just like with Mongo or any NoSQL/non-traditional solution, you have to understand how the trade-offs and capabilities of the database relate to what you're using the database for. You also have to design your data storage with these tradeoffs in mind.

For example, joins. There are no joins in Mongo or Cassandra and anything working around joins is simply not going to be as fast as a traditional database's join. If you need to do joins all the time, you will be in pain. So the answer is to deduplicate your data, such that joins are not necessary for frequent operations.

In particular, with Cassandra, while it's great at many things, such as write speed and write availability, you have to be very careful with your data design to get the results that you need. And you have to be cognizant about the querying that you need to do.

Cassandra has really weak in-store aggregation and filtering, as in there is almost no in-store aggregation and there is no filtering other than by a prefix of a column or a key (prefixed subset). So if your column names are made out of a composite parts A:B:C, you can scan for A:* or A:B:* (or A:[some value of B to some other value of B]), but you can't do :B: or *:B:C.

The advanced trick is to use ordered rows, which are so strongly discouraged (because you can shoot yourself in the foot with a key distribution hotspot), which allows you another axis of prefixed subset filtering. But only one more axis.

Sorting? Cassandra doesn't sort. Cassandra project leadership thinks that sorting should be done in the client. If you want to filter a subset of keys in the shape of A:B:C, e.g. get all keys of a certain value of A and sort B:C, you have to do the sorting yourself. If you wanted to do a top-N report, you have to retrieve all that data to your client and then sort.

The only sorting in Cassandra is the hierarchical column (and optionally key) ordering. So if you want to have quick top-N reporting functionality on values A and B from an A:B data tuple, you end up maintaining two indices (i.e. precomputing query results). One such index has columns that start with A and another starts with B.

But then the indexing support is particularly weak. Secondary indexing is only done on values, so if you want to index portions of your keys, that's not natively supported. Also, only in Cassandra 1.2 is indexing finally "write-only," instead of "read-then-write." (Write-only performance is much faster.)

There are no triggers, so you can't write custom indices where you can atomically perform "read-then-write" operations to maintain an index. Instead, you have to write all such custom indexing logic yourself and take a hit for the transmission of all the indexing mutations over the network wire. This hurts particularly bad when you have a cluster distributed over geographical regions (i.e. slow/expensive link).

Cassandra does have the ability to count the number of columns, but only in one row (w/ only the same prefixed subset filtering available). Counting columns in multiple rows is not available, even if these rows are co-located on the same node.

Map-reduce is available, but it is not suitable for frequent queries (not meant to be run quickly, just like map-reduce in Mongo is not something you want to be hitting very frequently).

So, of course, whether these are issues for you depends entirely on your data design. There are many things that Cassandra does well and certain data shapes for which it is just diesel. It's quite ops-friendly, rolling full-uptime upgrades are reliable and are a key priority for the Cassandra team.

So Cassandra is even more specialized in terms of its uses than Mongo. If the original author of the presentation tried to use Cassandra for the same kind of data he used for Mongo, he probably would've written an even more scathing article.

pkolaczk · on May 12, 2013

Cassandra and many other NoSQLs are designed primarily for OLTP workloads, not OLAP. OLTP is almost exclusively "find me sth by primary key" (see the TPC-C benchmark used for RDBMSes). Sorting huge amounts of data, top-n queries, skyline queries, aggregation, joining huge data sets, complex filtering belong to analytics world, not OLTP. Unless your whole database is very tiny, let's say 10MB, those operations are pretty expensive even in RDBMSes. That's why those features are deliberately not included in Cassandra. Contrary, MongoDB took a different route - they include some of those features and then seriously underdeliver on many of them.