Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I enjoyed the post. Good links to a lot of relevant, recent stories & events.

Not the article's fault, but it cites the "ClickHouse Cost-Efficiency in Action: Analyzing 500 Billion Rows on an Intel NUC" article that was published January 1. It's a week old, & I kind of feel like I'm never going to get away with it. It seems like a great, fun, interesting premise, but the authors took what is a challenging, huge data-set, and, under the guise of making the data look "realistic" they drained all the entropy out of the dataset, & then claimed they were 10-100x faster.

Well, yes, maybe for some workloads maybe. Maybe the changes they made might in some circumstances be "realistic" for some IoT use cases, maybe.

But I feel like I'm going to see this article come up again, and again, and again. And each time, I'll have these frustrations, about how while they may still be running queries on the same number of rows, they are running queries on many orders of magnitude less data. It's a fun read, & genuinely useful- in some circumstances- tech, but I don't expect to see this nuance showing up. I'm already weary, seeing this Clickhouse article again.



The difference between rowstores (Scylla/Cassandra) and columnstores (Clickhouse) comes down to the physical layout of data with batch/vectorized processing and other techniques.

There will always be a 1-2 magnitude increase in performance regardless of the data. They also used the same number of rows, except with smaller cardinality in measurements which would make an insignificant speed difference.


Stop reading the articles, benchmark it yourself then write about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: