I have followed various snippets of community folklore advice on this topic over the years (many of them suggesting that the proper number of connections should relate somehow to the number of CPU cores). It was, therefore, refreshing to see one of the core developers, Bruce Momjian, publish modern advice[0] based on empirical observations he made while working on real client problems through Enterprise DB. Spoiler: on modern hardware, without an over-agressive shared_buffer config setting[1], up to 250 direct connections to the DB may be made without detrimental impact.
> Oltp databases typically cannot process more than 5 x cpu cores without having to time-slice among cpus
I think this is entirely wrong in reality. In most read-mostly OLTP workloads, due to client<->server latency and application processing times, my experience as well as measurements show that at that connection count the machine will not be utilized sufficiently.
5x may be around the peak throughput when a benchmark client runs on the same system and connects via unix socket or localhost tcp. But even just benchmarking over a fast local (10GBe, <10m cables, one switch) connection moves peak throughput quite a bit higher than that.
Few real world scenarios have clients sending queries back-to-back without any gaps in between. They have to process the results e.g. go through some template engine, shuffle the data to the web server.
I think for OLTP - and many other workloads - any guidance based on CPU cores is going to be so off for 90% of workloads to be more misleading than useful. The request patterns and thus bottlenecks vary far too much.
For write heavy OLTP workloads on medium-high latency storage (cheaper SSDs, all network block stores, spinning disks) a larger number of connections also can actually be good from a cost perspective - due to group commit fewer iops will be needed than when the same workload were split over two instances.
It's common to put a connection pooler such as pgbouncer in front of postgres, usually in the same datacenter/cloud region, sometimes on the same machine.
In that case, the behavior you describe will happen between the clients and the pooler, then the pooler will mux the clients' queries onto a smaller number of connections to the database.
In many configs, once all the pgbouncer -> postgres connections are used, any more incoming queries will be queued. Each time pgbouncer gets a result from postgres, it pulls another query off the queue, so if queuing is sustained you'll see back-to-back requests on the pgbouncer -> postgres connections without gaps in between.
I agree there's no general CPU core guidance here. Gold standard is to representatively load test your app, safely beyond your target load, or to destruction if nobody can tell you what the target load is.
Pooling has a huge effect in this regard when using statement level pooling. Which isn't suitable for all that many workloads. With transaction level pooling it's still significant, but latency / processing effects become significant. And even tx pooling has a lot of issues - lack of proper prepared statements support being the biggest.
The last issue really could be solved, but to my knowledge none of the common poolers do.
Imo the medium-long term solution here is so separate out connection from process / thread. Neither deals well with many realistic and important workloads (with the overhead being smaller but significant with threads). It's a huge project with lots of prerequisites though (aio, threading, non recursive query execution)...
> I thought process forking was a fundamental aspect of Posgres' architecture. Is it correct to read above as effectively rearchitecting Postgres?
Is it a small change? Certainly not. But it'd also not touch the majority of the code.
There have been prototypes for just going to threading by liberally sprinkling thread-local markers around - it works, but is fairly ugly. But more incrementally working towards that is quite doable.
I agree. I normally recommend multiplying the number of cores by a 2-5x factor, and then dividing all that by a subjective "% utilization of the connection", which is obviously a value lower than 1 and results in an increased value for max_connections.
This % of utilization accounts for factors like the session not sending back-to-back queries, the io_wait on the server, which is turn potentially derived by the speed of your I/O subsystem, etc.
However, I'd also advise to only look at the TPS peak as the optimal point. That is indeed typically achieved at high connection numbers. Postgres is reasonably good at managing large number of connections, in terms of throughput. However, latency may go to the roof.
So to find the peak more latency than tps needs to be considered IMO (or, obviously, both).
I think these measurements include the page table entries in the kernel? It's my fuzzy understanding that page table entries are 64bit on amd64, so if you have 64GB of shared buffers (which is quite a lot) you'd have 128 mb overhead just to map that with 4kb pages. Which is to say the overhead is proportional to both the shared buffers size and the page size.
It is not a constant factor. Saying it is about 2mb is misleading.
It's possible I'm missing something special about postgres, but that's how shared memory between processes works on Unix systems in general.
MySQL uses threads which are much lighter weight because they don't need their own page table entries, and they don't need to do as many TLB flushes when switching contexts between them. It's one of the architectural decisions they did right I think. Postgres went with processes so it's easier to kill a connection cleanly, without leaving a things in a bad state. I think there are other ways of solving that problem though.
The < 2MB figure is with huge pages (2MB ones in this case) enabled. I used 16GB of shared_buffers. The measurement at the end shows how much of a difference in memory usage that makes: The page tables use 6.5MB per process without huge pages, 132KB with.
So sure, you can have shared buffers set so large that it's more than 2MB per process - but there is a fair bit of headroom to 2MB. So I don't think it's particularly misleading.
I think threads are the right call, and that we should move to threads at some point. Partially due to the overhead you mention, but more importantly because there's a lot of features that are much harder with separate processes. We have a fair bit of redundant code due to it: one version working with plain pointers, one with pointers differing between processes, do deal with dynamically allocated shared memory for things like parallel query.
It's a decent ballpark estimate, but it's definitely not a constant factor. People with really huge database deployments (like machines with 4TB of ram) must be crazy if they're not aggressively using pgbouncer to keep the number of connections down.
I'm actually quite heartened to hear that the attitude among Postgres developers is changing. Back in the day whenever people would bring up threads vs processes they got shouted down because processes are clearly superior and threads were "just implemented as processes on linux anyway". That's going to be one hell of a job to change now though.
> must be crazy if they're not aggressively using pgbouncer to keep the number of connections down.
With the limitations around PG poolers that's not always that easy :(
I've worked on a number of very large postgres instances - the backend memory usage wasn't usually a major issue for the very large ones (i.e. not schema sharded ones, where it's a large issue). The TLB miss ratio however was a major bottleneck on them, even with huge pages (perhaps even because of huge pages, because the TLB for huge pages used to be so small).
> I'm actually quite heartened to hear that the attitude among Postgres developers is changing. Back in the day whenever people would bring up threads vs processes they got shouted down because processes are clearly superior and threads were "just implemented as processes on linux anyway". [...]
>
> To Tom Lane: I told you so :)
The project is bigger than Tom Lane ;)
I think it's been pretty clear that threads have more advantages than disadvantages (which are substantial - hello mmap_sem) for quite a while. But that doesn't necessarily mean that we should have changed it a couple years back - as you say, it's not a small change, and it'll cause some disruption. There arguably were (and perhaps are) more crucial issues.
> That's going to be one hell of a job to change now though.
Yea. But it's doable. Personally I think the PG internal changes not the hardest parts - that's having to deal with all the extension out there. Both to ensure that they are adapted, but also managing the pain of having to deal with all the API evolution.
> a) You can't hold 5000 open connections out of which 4900 are idle without performance impact.
Author here. I agree. In fact this blog post only exists because a blog post on analyzing the various bottleneck around connection scalability in postgres was getting too long, so I split it out... That post has numbers showing the slowdowns:
https://techcommunity.microsoft.com/t5/azure-database-for-po...
There's a significant amount of improvements in Postgres' development branch addressing the - in my opinion - major source for the performance impact of lots of idle connections.
> b) Opening a new connection takes way longer than connecting through pgBouncer running on a differnt(!) box.
Hm. How long is way longer? And how high is shared buffers on that instance? In tests it's pretty easy to reach a few thousand connection establishments/second. That's not great, but also not terrible.
If the instance is otherwise very busy (e.g. due to the above issue), it can make connecting very slow, due to contention on very important locks.
I’ve personally seen pgbouncer do thousands of connection establishments per second without breaking a sweat and also handling immediate queries on many of those connections while running under a tight single-core constraint. That was caused by a thundering herd, and I don’t plan to repeat it, but unless your benchmark was also on a single-core machine, I’d say that’s “significant”; PGBouncer really has no overhead on connection startup, while simply copying the shared buffers table into a new process represents a real overhead for postgres proper.
A TCP connection is really lightweight, when stripped down enough. Thousands of connection establishments per second per core is not actually gonna break a sweat.
I really didn't want to give the impression that I think poolers are useless - I don't, I think they're often crucial. Sorry if I did so.
> while simply copying the shared buffers table into a new process represents a real overhead for postgres proper.
You mean the page table being large due to the large shared memory allocation? With huge pages that's not that large anymore (132KB for 16GB of s_b). My impression that with huge pages the problem moves to be primarily the locking around the huge pages when a lot of processes constantly fork and exit, than the actual amount of copying.
I didn’t get that impression at all. I just want to make clear the orders of magnitude we’re talking. I do 100:1 multiplexing with pgbouncer; There’s a few hundred connections open to our Postgres database, from application servers that have tens of thousands of connections open, all multiplexed through pgbouncer.
You’re definitely right that hugepages make the allocation pretty small, but it still has to walk the page table to copy it; I’ve seen that be a real bottleneck in apps. Then again, I might be misremembering; The problems could have been simply from having a huge number of 4k pages to walk. What I’m trying to say is: A fork() call can be relatively cheap, but is rarely as cheap as an accept().
The biggest limiting factor in my experience is actually Work Mem. You can reduce it quite a bit from the default (which I was remembering as 10MB/connection, though apparently the default is 4MB; it might be AWS RDS Defaults that I had in my head). Even halving that, to 2MB, it still presents a significant overhead. A thousand connections takes two gigs of memory - Obnoxious, but workable. Ten thousand? Now you’re biting significantly into the size of your working set.
> A fork() call can be relatively cheap, but is rarely as cheap as an accept().
Agreed, obviously. While I personally am not that concerned with the cost of connection establishment, and much more concerned with the context switches in a threaded / process model, the fix for the latter would also likely fix the former. I think there's a few higher priority issues in PG (some are prerequisites too), but it's somewhere in the top 5 issues
> The biggest limiting factor in my experience is actually Work Mem.
For me it hasn't been that big an issue in practice. It's transient memory usage, i.e. it's only used in the backends processing queries, not idle ones. And many - but not all! - cases where you have a huge number of connections most queries are simple, and use a good bit less than 2MB.
There are a fair number of issues with work_mem, don't get me wrong. But more around it being used several times in more complex queries, than the simple fact of using some memory for query execution. And it being hard to limit the number of concurrent queries rather than the number of connections.
Yet long-lived connections (processes) suffer tremendously on Postgresql servers with high cores (> 100) and a decent amount of memory (> 256GB). Connection pooling is good, but you gotta recycle those connections on a regular basis.
Connect using Unix sockets. It is just shared memory so you have zero connection overhead.
For example, connecting through sockets to pgBouncer saves 50% CPU.
Unfortunately there are some kernel tuneables (buffer sizes) which you have to set at compile time to support massive data transfers on sockets.
Obviously this only works if you are on the same machine and if you connect directly to Postgres this way, it will happily spawn a process for every query :(.
If anybody has a good idea of how to accurately subtract memory that is shared between processes... It's doable with /proc/[pid]/pagemap + /proc/kpagecount, but that's fairly expensive...
Practical experience with AWS RDS postgres is that an m4.large instance chokes at 150 connections. We are currently planning to test RDSProxy, but does anyone have experience with comparing this service to running pgBouncer ourselves?
[0] https://momjian.us/main/blogs/pgblog/2020.html#April_22_2020
[1] https://momjian.us/main/blogs/pgblog/2018.html#December_7_20...