Big Prometheus: Thanos, Cortex, M3DB and VictoriaMetrics at Scale

cfors · on Jan 9, 2020

Before anybody thinks that they need something like this at work, I have seen single node HA Prometheus set ups work at one of the largest CDN's in the country for metrics.

Reddit's own Kubernetes infrastructure team uses single node (pod) Promethei as well. [0]

If you look all of the components that are required to run Thanos [1], the operational complexity is incredibly high. I know its a shiny tool, that is super cool but please make sure you have an actual need for some of these before devoting resources to them.

[0] https://www.reddit.com/r/kubernetes/comments/ebxrkp/we_are_t...

[1] https://improbable.io/blog/thanos-prometheus-at-scale

jrv · on Jan 9, 2020

Prometheus author here (Julius).

Generally agreed that you can get far with a single Prometheus server (or many independent vanilla Prometheus servers, potentially also using Prometheus's own federation). But I still recommend Thanos as an extension to a lot of people. I like Thanos because it's so easy to deploy alongside an existing Prometheus installation, while itself being mostly stateless (long-term state is kept in object storage), and it gives people:

- a global view over multiple Prometheus servers - deduplicated view over servers in an HA pair - durable long-term storage for little cost

The Thanos architecture diagrams (especially the one in their README.md) can look a bit intimidating, but I find it sooo easy to get started with in practice, since you don't even need to deploy all of the components to begin with it. I usually tell people to just drop in a Thanos sidecar next to each of their Prometheus servers, so they will get backups of all their Prometheus server data (for those who are interested in long-term data retention). And then later, they can add the Querier component for an integrated view over multiple servers. And then later they can deploy the Store gateway to also integrate back long-term data into that view. And then at some point, the compactor...

Without being a Thanos expert, it took me ~15 minutes to deploy all those components (+ Minio for object storage) in front of a training audience that wanted to know more about Thanos (while reading Thanos + Minio docs). Of course a proper production deployment always takes way more time, but still I like how conceptually simple it is to integrate Thanos with Prometheus.

gnrl · on Jan 9, 2020

Do you have by any chance recorded this?

jrv · on Jan 13, 2020

No, sorry, it was a private commercial training.

chupasaurus · on Jan 9, 2020

Reddit writes out the metrics to other solution [0], so there is no difference from the setups in the article.

[0] https://www.reddit.com/r/kubernetes/comments/ebxrkp/we_are_t...

halbritt · on Jan 9, 2020

Yet Prometheus runs out of steam pretty quickly if one has a need to aggregate metrics from multiple regions or retain metrics for a lengthy period of time.

a10c · on Jan 9, 2020

I think you're conflating the scraping of time-series data with it's storage.

cfors · on Jan 9, 2020

Nope, to be clear I am talking about Prometheus' own TSDB.

ibspoof · on Jan 9, 2020

When my team agreed to use Prometheus from the client side we looked at Thanos, Cortex, and M3DB, but none of them gave us the flexibility and comfort of adoption for a small team providing a service to 10s of internal groups. We have many private internal DCs and needed metrics to be stored in the cloud, pulling data to the cloud seemed awkward and required access rights we couldn't get.

We ended up using Postgres 10 w/ TimeScaleDB and their Prometheus plugin with a simple emulated push gateway that converts a prom formatted http post to a postgres batch insert. Postgres is 3 nodes monitored with Patroni.

Working great for us and handling 1000+ metrics a second with ease and we get SQL for both real-time metrics for monitoring and analytics for business needs. We are using about 10-15% of our systems giving us room to grow.

jrv · on Jan 9, 2020

> We have many private internal DCs and needed metrics to be stored in the cloud, pulling data to the cloud seemed awkward and required access rights we couldn't get.

You mean having a Prometheus server run in the cloud, but then pulling from on-prem things from the cloud? Not sure how either Cortex or Thanos would require that, as you'd still run on-prem Prometheus servers for them, but then collected data is pushed to the cloud in the end. But maybe I'm misunderstanding what you mean here.

> Working great for us and handling 1000+ metrics a second

Curious about this - I would expect any system to be able to do that easily, as that's a tiny, tiny amount. A single big Prometheus server can do roughly 1000x that (I think someone once managed to do 1M samples/second ingested).

akulkarni · on Jan 9, 2020

TimescaleDB co-founder here. Glad to hear it is working well for you. If you have any feedback on how we can make it work even better, would love to here it - ajay (at) timescale.com

For others reading this: TimescaleDB can easily scale to millions of metrics a second [0][1], but to be honest we've found that few folks need that level of scale.

But people do care about whether or not they can trust something new. If you're already comfortable with Postgres then you're already comfortable with TimescaleDB.

Also! Other solutions are great too. E.g., I really like the Thanos model with object storage. I don't think there is a "best" option out there - it really just depends on what you want to optimize for your use case.

[0] https://blog.timescale.com/blog/building-a-distributed-time-...

[1] https://news.ycombinator.com/item?id=20760324

valyala · on Jan 9, 2020

Ingestion rate of 1K metrics per second can be handled by any solution on low-end hardware. I believe even ancient Raspberry PI would feel comfortable with such a load :)

VictoriaMetrics scales to 50M+ metrics per second on a single server [1], [2].

[1] https://mobile.twitter.com/MetricsVictoria/status/1209116702...

[2] https://mobile.twitter.com/MetricsVictoria/status/1209186575...

valyala · on Jan 9, 2020

VictoriaMetrics author here. I like the post, since it is cleanly written and it isn't biased to certain solution. I'd recommend readers trying all the mentioned solutions - Thanos, Cortex, M3DB and VictoriaMetrics and then choosing the solution that fits them the best.

Each solution has its own weak and strong points. The main selling points for VictoriaMetrics are:

* Operation simplicity. This is especially true for a single-node version, which is represented by a single self-contained binary without any dependencies. It is configured by a few command-line flags, while the rest of configs have sane defaults, so they shouldn't be touched in most cases.

* Low resource usage (CPU, RAM, disk space and iops, network bandwidth).

* High performance.

See also an interesting talk from PromCon 2019, where all these solutions are compared by Adidas monitoring team [1].

[1] https://promcon.io/2019-munich/talks/remote-write-storage-wa...

raisingtable · on Jan 10, 2020

So here is my case. I'm running multiple Prometheus HA pairs to cover different teams. At the moment, I'm using Thanos and VictoriaMetrics in parallel to test them out.

Thanos was the first I set up as VM wasn't open-sourced yet. It wasn't hard to setup and had it running in about a day together with Minio as an S3 backend. To this day it's running without a problem, apart from an alarm every now and then that the Store or Compactor couldn't get something done. But I didn't look too much into it since everything graphing-wise seems to work. Upgrades are also easy and I love the global querier option. I sometimes see people having OOMs on a rather "large" servers on Slack, but Thanos team is suppose to be working on optimizing memory usage and it's getting better and better.

After the last PromCon, I also configured VictoriaMetrics. Installation was as simple as it can be, way simpler than Thanos, but I'm using a single node version. It works really good for the last 3 months. Resource usage is a lot lower than on Thanos.

Both solutions have their own Slack channels with developers and users there, so it is easy to get help and resolve issues.

In the end, I think I'll go with VM in my case, since it has less moving parts, doesn't need S3 backend (we are on-prem and don't have a production S3 storage) and lower resource usage. It can also ingest InfluxDB metrics, which is a massive bonus for me, since NOC team is using a solution that can only send metrics to InfluxDB (snmpcollector).

netingle · on Jan 9, 2020

Cortex author here (Tom Wilkie). Great post that honestly highlights the differences between these systems - thank you!

The biggest take home here - and the first thing the post mentions - is the a single HA pair of Prometheus servers is enough for 80-90% of people. TLDR you probably don’t need Cortex (or Thanos, etc)...

...unless you run multiple, segregated networks (regions). Then something like Thanos (or Cortex) is useful - not for a the scale argument, but because you need a way to “federate” queries and get that global view. IMO!

chucky_z · on Jan 9, 2020

Isn't this the whole point of the federate endpoint? That you just run a central Prometheus pair to federate metrics at low resolution from a ton of places?

I only care about high resolution metrics for alerts. Otherwise I can just take a handful of them at 5m intervals, but from a lot of places.

jrv · on Jan 9, 2020

There's different tradeoffs... Prometheus's own federation is a pretty simple scrape-time federation - a Prometheus server pulls over the most recent samples of a subset of another Prometheus server's metrics on an ongoing basis. Thanos does query-time federation rather than actually collecting and persisting data for all "federated" servers in a central place (other than the e.g. S3 bucket for long-term data). So with Prometheus federation you have to choose pretty carefully which aggregated stats you'd like to pull into some higher-up Prometheus layer, and then you only have access to those (in that server). Thanos allows you to query over the data in multiple Prometheus servers at once, in all their detail.

And I think Cortex is mostly useful for people who want to run a big centralized, multi-tenant service in their org to keep all the global view and long-term data. (most people tend to use Thanos)

rektide · on Jan 9, 2020

I enjoyed the post. Good links to a lot of relevant, recent stories & events.

Not the article's fault, but it cites the "ClickHouse Cost-Efficiency in Action: Analyzing 500 Billion Rows on an Intel NUC" article that was published January 1. It's a week old, & I kind of feel like I'm never going to get away with it. It seems like a great, fun, interesting premise, but the authors took what is a challenging, huge data-set, and, under the guise of making the data look "realistic" they drained all the entropy out of the dataset, & then claimed they were 10-100x faster.

Well, yes, maybe for some workloads maybe. Maybe the changes they made might in some circumstances be "realistic" for some IoT use cases, maybe.

But I feel like I'm going to see this article come up again, and again, and again. And each time, I'll have these frustrations, about how while they may still be running queries on the same number of rows, they are running queries on many orders of magnitude less data. It's a fun read, & genuinely useful- in some circumstances- tech, but I don't expect to see this nuance showing up. I'm already weary, seeing this Clickhouse article again.

manigandham · on Jan 9, 2020

The difference between rowstores (Scylla/Cassandra) and columnstores (Clickhouse) comes down to the physical layout of data with batch/vectorized processing and other techniques.

There will always be a 1-2 magnitude increase in performance regardless of the data. They also used the same number of rows, except with smaller cardinality in measurements which would make an insignificant speed difference.

chupasaurus · on Jan 9, 2020

Stop reading the articles, benchmark it yourself then write about it.

freeseacher · on Jan 9, 2020

Let me explain my experience with tsdb selection. in 2018 we understood that we need something for long term storage. Selection was between thanos, elasticsearch and m3db.

M3db looks promising but after reading issues and docs i found i have to test it like database not like a drop in solution. for example that topic https://groups.google.com/forum/#!topic/m3db/6iG2NL7hJ7A And cortex and thanos announced that tweet https://twitter.com/fredbrancz/status/1043060822988259333

Elasticsearch got disqualification because of no remote_read support. So i stopped looking for anything for at least half year just updated retention policy in Prometheus to 150d.

Also VictoriaMetrics was banned because of no source code. Also https://github.com/akumuli/Akumuli was banned because of nobody hear about it. :(

Than after some time VictoriaMetrics appears to be open source and there was no issues with rate function and useless extrapolation.

So i test it on a small setup at about 7k metrics per second on single server. And it was amazing. Than 14k/s and 20k/s Previously i have the volume for Prometheus data and it was about 30 gigs on smallest install to 70 gigs on largest Moving from storing 30days in Prometheus to 90 days in vm was the huge benefit. On every of three instances with 7, 14 and 20k metrics per second i can extend retention from 3 to 5x on the same volume. With same dashboards. With same alerts. Just added remote read and remote write.

Than i decide to take it on a real life web scale production. So i started from 11 servers f2-micro on gcp. * 3 storage * 2 insert nodes * 2 select nodes * 2 promxy * 1 grafana * 1 selfmon prometheus

Got lots of expected ooms on 60k per second. Than i move to n1-standart-1 for storage and insert. It can handle at about 650k per second insert load for several weeks without ooms or any unaxpected behaviors. That was real life data from prometheus-operator from one of our rc clusters. node-exporters, application metrics and kubemetrics.

Tuning it to n1-highmem-2 for storage nodes so get enough room for background merges and so on.

Also i copy my Prometheus rules from prod to promxy (at about 200 in sum). That makes some noise for read. So i got at about 70 reads per second and pretty 90% cpu utilization on promxy servers. But almost no additional cpu load on vm servers. So i just bump all numbers in queries from seconds i moved to minutes, minutes to hours and hours to days in every query that have offset or rate or increase. That add some load to vm but not that much i expected.

In summary i'm amazed with simplicity of scheme i got. Performance is also great. My dashboards looks the same in Prometheus and VictoriaMetrics.

Oh. by the way i have some experience asking questions in issue tracker of Prometheus and VictoriaMetrics. And honestly prefer Aliaksander style of answering - long and with good under the hood info.