I understand that postgres is pretty decent at storing large binary data though?...

yarcob · on Sept 28, 2021

> I understand that postgres is pretty decent at storing large binary data though?

Storing large files in PostgreSQL is possible, but I would advise against it.

PostgreSQL uses a technique called TOAST to store large objects. If you store something in a bytea column, it'll first be compressed (either deflate or a new algorithm that I forgot the name). Then it will be split into 8KB chunks (page size) and stored together with all the other data.

Since most image formats are already compressed, this just wastes a lot of CPU cycles.

Also, the way storage is allocated in PostgreSQL may bite you in the ass. By default, storage is not returned to the OS. If you delete all images, the data will stay on disk and marked as reusable. Only if you do a full rewrite of the table will the free space be made available again.

Also, backups will become a pain. The easiest way to backup PostgreSQL servers is to just do a pg_dump. It works great up to a few gigabytes of data. If you have images or other large objects in the database, then pg_dump will quickly become unfeasible and you will have to find another way to do backups. Not a deal breaker, but something you should be aware.

Also, for best performance, the database should be on fast storage (eg. NVME SSD), whereas the image data typically doesn't need fast storage.

## A Practical Example:

One project I work on is an image database with 50k images. The high resolution previews are around 20GB. The PostgreSQL database with all the metadata is around 20MB.

It takes 10 seconds to do a full backup of the database. Everytime I change something in the application, I first create a dump of the database (which is just a few MB compressed). Then I do my changes, and if I make a mistake I always have a backup that I could restore in 10 seconds.

If the images were stored in the PostgreSQL database, a full backup would take much longer and would require a fast internet connection. And it would be much harder to work with.

My recommendation: Use the filesystem for storing files.

jrochkind1 · on Sept 28, 2021

Thanks for info! What about binary large objects instead of bytea?

Size/speed of backups is a good point to consider.

yarcob · on Sept 28, 2021

I've never used PostgreSQLs large objects. You have to use this awkward API to use them, they aren't stored efficiently either, and as far as I can tell it's just a legacy feature that nobody has bothered to deprecate yet.

In my opinion the only neat thing about large objects is that you can edit files with a posix-like API in a database transaction, but I can't think of a scenario where that would actually be useful.

Arnt · on Sept 28, 2021

Postgres is indeed good at that. I know from experience.

If you want to commit ten things every second all week, don't transfer so much data in a single transaction that you hold any important locks for half a minute. And if you use replication, test while replicating.

If the blobs you store are small enough to not disturb your commit rate, then pg's drawbacks are IMO smaller than those of the alternatives, particularly if you want some sort of commit that returns when both image data and something else have been committed.

jrochkind1 · on Sept 28, 2021

Thanks for info!

So you've stored images in the db in production and had it work out fine?

I don't want to commit ten things a second. It would be a pretty read-heavy write-light load. Sounds like that would avoid one path of riskiness at least.

Arnt · on Sept 28, 2021

Yes, exactly.

Many people answer a question like this: "How good is postgres at storing big blobs, relative to small things?" Postgres is indeed worse at storing large things. However, I think the most interesting question is something along these lines: "What's best, storing everything in Postgres and accepting that inefficiency for the big blobs, or implementing transactions, replication and/or backup with a part of the data in Postgres and the big blobs elsewhere?"

I'm awfully fond of having working backups, and postgres' performance problems with the blobs haven't been big enough that I've really noticed.

tinus_hn · on Sept 28, 2021

Does it still require a backup/restore cycle for upgrades? That could be an issue.

yarcob · on Sept 29, 2021

Yeah, that hasn't changed. You can also use the pg_upgrade script instead of a full backup/restore, or use logical replication if you want to upgrade the db server with zero downtime, but upgrading PostgreSQL is still somewhat annoying.