Dbdev – A database package manager for PostgreSQL trusted language extensions

mike_hearn · on April 14, 2023

At KotlinConf today I gave a talk on designing apps with two-tier architecture, where you implement your entire app without the web stack appearing anywhere at all. Instead you publish desktop and mobile apps that connect directly to an RDBMS like PostgreSQL via its native protocol, and use server extensions for any logic that is inconvenient to do with SQL.

This approach might seem horrifyingly outside-the-box but has a lot of advantages, and some of the reasons we didn't do things this way historically have been solved in recent years.

Because it was KotlinConf the demo uses PL/Java, which is pretty nice because there's such a healthy ecosystem of stuff based around JDBC and because deploying JVM stuff doesn't require any sort of cross-compilation. PL/Java also supports (for now) trusted extensions using sandboxing, although of course the sandbox can just get in the way and normally you trust your own server anyway so this is a double edged sword.

The demo code can be found here (it's a prototype and nobody reviewed it yet so be gentle)

https://github.com/hydraulic-software/bugzino

I'll write up a blog post version of the talk, but for now I had to mention that DBaaS providers don't actually enable this sort of design because they like to wall off the full power of the RDBMS behind custom APIs. But in two-tier design you really lean into the database and use all of its features. So, it'd be nice if:

a. database.dev were to support PL/Java extensions.

b. Supabase were to allow direct connections, as the native DB protocol supports a lot of features that otherwise have to be sort of hacked on top of HTTP. Ultimately, HTTP is designed to fetch hypertext whereas the PG native protocol is designed to work with data, and that difference shines through in a bunch of ways.

GordonS · on April 14, 2023

> This approach might seem horrifyingly outside-the-box

I mean, this is how we routinely built client-server apps 20-30 years back. Everything old is new again!

One of the big reasons so many moved away from this approach was security. Giving end users direct access to the database (even if "obscured"), gives me the shivers!

mike_hearn · on April 14, 2023

One of the things that's changed in recent years is that free databases like PostgreSQL have far more security features. You can grant fine grained privileges on individual objects like queries (via views), individual table rows, stored procedures/functions and so on.

In the demo app, I use a simple approach to security: users don't have permissions to directly access any tables. Read access is via views and could use row-level security (RLS), and write access is always via stored procedures. Stored procs can also act as queries, and they can be written in any supported extension language, so any access control and privacy policy is implementable.

Still, security in two-tier apps is a nuanced topic. Some things get better and other things are worse.

Some of the remaining things to solve are:

1. RDBMS engines tend to be written in C. They're good code and fairly trustworthy, but still, web servers tend to be written in memory safe languages and databases aren't.

2. DDoS defense companies have more options for protecting web apps (sending js challenges and captchas). If you have a mobile app then you're exposing raw APIs anyway, so this is maybe not a big change.

3. You have to set resource quotas to stop malicious users submitting expensive queries. Of course this is also a concern with web apps, just translating HTTP to PG doesn't make it impossible to spam servers with expensive queries.

4. Data can be protected with ACLs pretty well, but schemas not so much. So you could leak an upcoming feature via the appearance of new object names. Web apps sometimes have the same issue with leaking info about new features in their JS though, and the sky doesn't fall.

For many app types the above concerns don't matter much. Any app that's internal or business-to-business for example. If you build auth around client certs then unauthorized users never get to connect at all.

Also! Consider that this design also eliminates many kinds of security issue. The talk goes into this. For example, XSS and SQL injection are eliminated by design! Those have been two of the most common and destructive bug classes over time. Also, web servers do not exactly have a track record of being unhackable, especially as wrapping SQL with HTTP handlers introduces many new places where access checks might be forgotten or incorrectly implemented. A SQL GRANT statement is pretty transparent and the implementation is well tested compared to app specific logic that might have been written by one person and never properly tested.

There's also other security-related benefits, for example, certificate expiry routinely causes massive and eye-bleedingly disruptive outages, but is eliminated as a problem entirely by this design. It works because code signatures are timestamped so don't break when the cert expires, and the db connection can be then be encrypted using a self-signed certificate that never expires. To rotate the key you roll out a new client.

majkinetor · on April 14, 2023

This is basically how its done today with tools like Postgrest.

RowanH · on April 14, 2023

Who remembers SQL in Delphi UI :)

jchmbrln · on April 15, 2023

This approach is very appealing, but when I put my SRE hat on my first concern is observability. Has anyone implemented tracing for Postgres with a span for each procedure call?

When I put my dev hat on, I want the option of attaching a debugger and adding breakpoints. Anything there?

mike_hearn · on April 15, 2023

With PL/Java you can at least debug stored procedures, because the Java Debug Wire Protocol is a protocol so you can run it over the network. There are instructions for how to attach a debugger in the PL/Java docs.

For debugging SQL remotely, I don't know what's available, if anything. Though if you only use SQL for short things, you can just run it and debug the query interactively I guess.

Most databases have tools for logging query timings and doing auditing. PGAudit is one for Postgres.

benatkin · on April 14, 2023

> This approach might seem horrifyingly outside-the-box but has a lot of advantages

You don't need to worry, there are a lot of people who are into stored procedures. If there weren't, there wouldn't be people disagreeing with them, like the author of Rails: http://web.archive.org/web/20060418215514/http://www.loudthi...

The slippery slope of liking stored procedures a whole lot means them taking over the rest of the middle of the stack, and that isn't a new idea.

I like the idea of having instant access to the database, and microservices are my favored approach.

mike_hearn · on April 17, 2023

I think DHH's perspective makes a lot of sense from the perspective of a web framework author. If you're going to buy into Rails it makes sense to buy in all the way, and not try to split your app logic across two very different languages and execution environments. Naturally he wants everything to be in Rails and for the DB to be treated as dumb as possible, it is always the way for a platform to want to subsume and abstract its lower layers.

In two-tier architecture there is no web framework, however, because there is no web server to begin with. All the UI logic runs client side and is written in the same language as the stored procedures. In this design there's nothing that's competing to be the place where app logic is implemented because you can't trust the client, so all trusted logic has to be in the database. In that case you buy into the RDBMS all the way, and really lean on its capabilities. The tension is resolved in a different direction.

saghm · on April 14, 2023

> This approach might seem horrifyingly outside-the-box but has a lot of advantages, and some of the reasons we didn't do things this way historically have been solved in recent years.

From a development standpoint, I definitely think this seems easier! My concerns about this would be almost entirely about security. How does this approach avoid leaking credentials that people could use to access the database outside of the application? More generally, how does this approach deal with the increased attack surface for the database from exposing it to the open internet?

mike_hearn · on April 14, 2023

See my other post for a discussion of non-user security topics.

Every app user has a database user, mapped 1:1, so there are no credentials to leak. The DB credentials are the user credentials. In the demo app repository I implement open signup with email confirmation using stored procedures/a server extension. It works by creating a guest user with a well known password, but it's locked down so that the only things it can do are run a couple of stored procedures. Those then send an email to confirm ownership of the address before creating a real DB user.

Open work: don't use passwords, mint SSL client certificates during signup with the private key stored in the OS keystore. Now there's no cookies or passwords to steal! Also, OAuth integration.

The guest user / open signup is certainly a weakness for the demo, which is public on the internet (a dedicated cluster though). It's locked down as much as possible but there are probably lots of ways to screw with the server even without being granted access to anything. And there are no rate limits on signup. It's really a very basic exploration.

conceptme · on April 14, 2023

Connections are pretty expensive in PG how can you do pooling if every user connects directly?

giovannibonetti · on April 14, 2023

Connection poolers like PgBouncer [1] (traditional) and Supabase's Supavisor [2] (new) come to mind.

[1] https://www.pgbouncer.org/ [2] https://github.com/supabase/supavisor

mike_hearn · on April 14, 2023

Supavisor looks great. Unclear if it can be configured to use https://github.com/pgaudit/set_user though.

That said, there are quite a lot of CRUD apps in the world that don't need to support lots of simultaneous users, or where you can just add read replicas quite cheaply. Think internal apps, dashboards, etc.

asguy · on April 14, 2023

I had fun with a similar concept: just access the database from the browser using nginx/TLS/lua: https://github.com/alexanderguy/pgress

If you're good with your authenticated users directly talking to the DB (which there are plenty of uses for), it's a great way to get your data into the browser.

spullara · on April 15, 2023

My first job in SF was with WebLogic and at the time their #1 product was a JDBC driver that let applets talk directly with databases proxied through a Java server, looked something like Applet<->JDBC<->HTTP<->Server<->JDBC<->Oracle (most of the time).

refset · on April 14, 2023

Omnigres, "Postgres as a platform", is very much in the same vein: https://github.com/omnigres/omnigres

mike_hearn · on April 15, 2023

That's cool. Seems they're focused on writing apps in SQL though.

yrashk · on April 16, 2023

Omnigres author here.

While it's definitely possible (and perhaps even advisable) to build many application in sql (especially if your traditional backend is mostly just slinging sql queries), Omnigres by no means limits users to that. Any other postgres PL can handle request.

Support for proxy proxying requests to Postgres-managed containers is coming, too.

The core idea here is less about using SQL or any other PL but increasing mechanical sympathy and decreasing integration, devops and network overhead of the traditional architecture.

Feel free to ask questions or stop by our Discord!

nijave · on April 14, 2023

Are there any performance issues with connection handling? Not sure how/where a connection pooler fits in.

mike_hearn · on April 14, 2023

This is also discussed in the talk, which I hope will be on YouTube soon enough (it was live streamed).

DB connections are heavier than web server connections, in particular in terms of memory (when using postgres, not sure about mysql). Multiplexers like pgbouncer can address this, but then you lose a bit of the deployment simplicity. Fortunately RAM is pretty cheap nowadays, you can configure the DB to disconnect idle connections, you can use multiplexers and you can make the client transparently reconnect when a connection is needed. So it turns into a RAM/user latency tradeoff.

There are some benefits though. Native DB protocols will tend to stream results in binary, avoiding a JSON encoding that can expand the bandwidth requirements. Also you can get rid of the server side load balancing, because good db drivers know how to load balance amongst R/O replicas already. A lot of deployment complexity in web apps is driven by the fact that browsers handle errors by just throwing up an error page and stopping, so you need to avoid even tiny amounts of downtime. When you control the client you can do failover and recovery more gracefully, which in turn relaxes the availability constraints on the servers a bit.

nijave · on April 14, 2023

It seems like this uses native Postgres security/roles which I think would prevent a connection pooler from working (unless you held open 1 connection per user)

Ultimately it'd be nice if Postgres could keep a hot worker pool for connections instead of forking processes but afaik there's been plenty of debate over the years and no consensus

I think high connection thrashing also eats up CPU and there's the issue of increasing load eats more RAM which reduces Postgres memory cache. We had a connection leak problem and clearing ~2600 PG connections saved something like 80-100GB of RAM

mike_hearn · on April 14, 2023

I think you can use https://github.com/pgaudit/set_user combined with some pgbouncer configuration to solve that, but agreed that postgres doesn't have perfect support for this scenario. It'd be better to have a proxy that is designed specifically for this use case that knows how to drive something like the set_user extension (and better if it were to be merged upstream).

mwcampbell · on April 14, 2023

Mike has called this out as a weakness of PostgreSQL, and an advantage of Microsoft SQL Server and Oracle, before.

claytongulick · on April 14, 2023

I'd love to see plv8 added to this.

Jerry S. has done an amazing job of wrangling the insane v8 build process into something that mostly works across platforms, but installation is still not particularly simple.

Unfortunately, due to the problems with that build process, Ubuntu and other flavors dropped the prebuilt apt packages.

Making plv8 a simple install would be a huge boost to the community.

jerrysievert · on April 14, 2023

wow, thanks for the compliment! it hasn't always been easy, believe me.

mike_hearn · on April 14, 2023

That's interesting - could you describe what makes the build and install so hard? I can't think of any obvious reason why it should be harder than PL/Java. V8 is pretty widely used!

jerrysievert · on April 14, 2023

v8 has its own build chain, which involves ~5gb of downloads. that build chain doesn't work on many platforms, and requires very specific versions of prerequisites including specific glibc versions.

the build chain consists of specific compilers, specific versions of chromium parts and pieces, its own sys root, everything.

since the API for v8 changes on a regular basis (not as much as it has in the past, at least), you cannot simply change out v8 versions, as they likely will not work.

v8 is also compiled differently for each embedding application, and thus a shared object won't work correctly across multiple applications, and besides is strongly advised against by the v8 team.

nodejs has its own build process that they've been maintaining since the v8 3.1.4 days, but it is specific to node.

so, it just adds up to a difficult maintenance.

mike_hearn · on April 14, 2023

Wow. That's kind of surprising. HotSpot is just autotools: ./configure && make and you're done. Plus the embedding APIs are stable.

Do you have any insight into what led them to have such a complex build? Is it just because it's a part of the Chromium tree? It doesn't seem inherent to the problem space.

jerrysievert · on April 14, 2023

they use their own build systems for everything, even skia. though skia has a bit less strict build requirements.

LunaSea · on April 14, 2023

Thank you for the amazing work indeed!

I know that at one point the idea of using a non-v8 JavaScript engine was floating around, due in part to the V8 build complexity and the API breaking changes.

Is this still something that is up in the air?

I know that Bun decided to use JavaScriptCore as a recent example.

jerrysievert · on April 14, 2023

I have been slowly but surely working on pljs in my spare time. it is nowhere near feature parity with plv8, but I hope to get it closer with more time. currently missing:

    * query execution/cursors
    * window functions
    * sub transactions

it is based on quickjs, is very fast/easy to compile, and while in long-running functions it is slower than v8, it has a very fast startup time that negates that for shorter functions.

but, it's not ready for prime time yet.

michelpp · on April 14, 2023

Good news for you then! plv8 is a trusted language and works with pg_tle just fine.

EDIT: I think I misunderstood your comment, plv8 itself is a compiled extension and thus cannot be a TLE, however TLEs can be written in plv8 once you have it installed. Sorry for the confusion!

nextaccountic · on April 14, 2023

From https://supabase.com/blog/dbdev

> dbdev fills the same role for PostgreSQL as npm for JavaScript, pip for Python and cargo for Rust in that it enables publishing libraries and applications for repeatable deployment. We'll be releasing the tooling necessary for third-parties to publish pglets to the registry once we’ve collected some community feedback and incorporate any great new ideas. Our goal is to create an open ecosystem for packaging and discovering SQL.

How does https://pgxn.org/ fit in this ecosystem?

supamichelp · on April 14, 2023

pgxn does not (yet) support the notion of TLEs, but it certainly could with minimal changes. We're currently in discussion with the pgxn maintainers about how that would look.

ccleve · on April 14, 2023

TLEs = trusted language extensions?

supamichelp · on April 14, 2023

mmmeff · on April 14, 2023

What's stopping me from adding an extension to this repository that adds something valuable while simultaneously exfiltrating entire databases back to me?

It's a cool idea, but I won't be the first to try it. You're placing an awful lot of trust in these modules.

thewataccount · on April 14, 2023

Nothing? It's a classic supply chain attack.

The same goes for pypi, npm, crates.io, docker.com, github, chrome extensions, etc.

Generally it comes down to trusted authors/package names. Many people trust `requests` from pypi. That being said even this doesn't prevent the package to be sold, or hacked, or from the owner later adding things, or a 3rd party sneaking something in via a PR. Some like docker (and technically most git based ones because of md5) can have specific tags changed at any time, so you could later add a malicious package to version 3.0.4 or whatever.

Many package managers (mostly linux ones) include key-signing, some package managers also have built-in scanners. Preferably you proxy/cache versions you need, and have a way to audit version upgrades - but in reality it's a lot of work to do properly.

Recent example of `pytorch` nightly on pypi accidentally having a dependancy named `torchtriton` that I think was name squatting? Anyway installing the pytorch nightly between December 25th to 30th would result in your home directory being uploaded, etc. https://www.bleepingcomputer.com/news/security/pytorch-discl...

EDIT: That said if anyone from supabase reads this - I'd love to see GPG signing, version hashing/integrity verification, 2FA if it's not their already, and I know this one is a lot of extra work - but any type of automated even cursory glance for malicious code.

michelpp · on April 14, 2023

> That said if anyone from supabase reads this - I'd love to see GPG signing, version hashing/integrity verification, 2FA if it's not their already

Yep, we've got all these on the road map, thanks!

> and I know this one is a lot of extra work - but any type of automated even cursory glance for malicious code.

Unfortunately a "cursory glance" is trivially sidestepped, and anything complete enough to be useful is probably impossible. Maybe you want to write an extension that automates logically replicating your data off the server, or truncates a table after processing it, to you that's useful, to someone else that's malicious, there's just no way to know.

If you don't trust the authors then the only way to trust a TLE is to inspect the code yourself in the context of what "safe" means to you. The pytorch example you mentioned was very likely caught by a human, if an automated system were possible for python (but likely impossible) it would have caught it sooner.

michelpp · on April 14, 2023

Like any package or plugin for any language or framework, you are responsible for understanding what you are installing. Since TLEs do not come with any compiled code, this is relatively easy for you to verify yourself with code inspection.

We expect a lot of DBAs to say "no way" to something like TLEs, that's fine, for similar reasons people don't use pip, npm, or cpan. You make your own risk/benefit analysis.

foreigner · on April 14, 2023

Can I use this on AWS RDS?

oliverrice · on April 14, 2023

RDS does have the underlying native extension that is required to install user defined packages (pg_tle) from database.dev. It doesn't currently have support for the extension that is necessary to make the in-database client work to install packages directly from the registry.

That's great feedback though: We'll add a new page to the website for each package that provides the snippet you can run on an RDS instance to load the extension and then run `create extension`.

We're also in talks with RDS team and experimenting with a few options that would allow users to query the registry directly. The most promising one we have found so far is using AWS Lambda with RDS's aws_lambda extension to do that http part, and pass the contents back to the database. Still a WIP though!

paulddraper · on April 14, 2023

No.

RDS has a strict list of approved extensions. (Since AWS is on the hook for the system stability, security, replication, etc.)

supamichelp · on April 14, 2023

RDS for Postgres 14+ does come with pg_tle, but it does not come with pgsql-http, so you cannot use the dbdev installer. But, you can still install TLEs, you just have to do so "by hand" by calling `pgtle.install_extension()` yourself, probably from a migration script depending on your tooling.

See the RDS docs here: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...

samwillis · on April 14, 2023

Partially correct, this doesn't work with RDS as you can't install pgsql-http (a requirement for automatic install). However all the extensions listed are "Trusted Language Extensions for PostgreSQL" which was developed by AWS to enable installing extensions written in "Safe" languages. So I believe that all the extensions listed could be installed manually.

https://github.com/aws/pg_tle

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...

paulddraper · on April 14, 2023

Oh cool!

koolba · on April 14, 2023

Is this like pgxn but specific to trusted languages so presumably less installation issues?

oliverrice · on April 14, 2023

yeah, exactly. pgxn also houses native extensions, some of which may be TLE compatible. database.dev is trusted language extension specific

We've reached out to David, who runs pgxn to see how we might collaborate and allow pgxn to be used as another registry that users can install TLEs from

noob4life · on April 14, 2023

How does `index_advisor` work?

oliverrice · on April 14, 2023

index_advisor is a project that almost made the cut for LW7. We decided to launch it as a pgtle instead to get some feedback before we (spoiler) probably release it in LW8

It uses the hypopg package to very quickly generate an index for each column referenced in the query and then re-run's an explain plan to measure the estimated "total cost".

If the expected "total_cost" reduces, we add the indexes to the list of recommendations and continue testing the remaining columns.

Its currently limited to single column indexes only but in a later update we'll enhance that to do more comprehensive searches with multiple columns in a single index

claytongulick · on April 14, 2023

Cool idea for an extension!

I had a whole set of comments about concerns about doing this to a large table in prod, and statistics and such - then I read about hypopg.

What a neat package, learned something new today.

bashinator · on April 14, 2023

You talking with the PostGIS folks?

oliverrice · on April 14, 2023

PostGIS is a native extension mostly written in C. Since there is no C trusted language it would be hard to fit PostGIS into the TLE paradigm

Its less of a problem for well known and trusted extensions like PostGIS though because it comes pre-installed on most hosted providers (Supabase, RDS, etc)