I worry a lot about how AI systems are aligned toward, say, fixing mistakes like...

I worry a lot about how AI systems are aligned toward, say, fixing mistakes like this.

Its my understanding that there are three state of arts right now. For simple things, prefiltering of queries external to "blackbox AI space", which is mostly useful for denylisting queries you don't want (if query.contains("boobs") reject). You can pre-prompt the AI to avoid unaligned results, which is cheap and easy. And: retraining/biasing the AI toward the right output by feeding it more input representative of what its operators defined as Aligned.

These are all... bad? Like, unless I'm missing some other tool AI operators have, all of these are so obviously bad to me that it causes me a lot of concern when I hear about, say, Microsoft firing their AI ethics team.

Pre-filtering will always have escape hatches (the recent twitter trend of asking ChatGPT to invent its own prompt compression algorithm it can understand out of context is a great example of this). Pre-prompting is brittle, and communities have already formed around jailbreaking, so you're begging for a cold war except you, the AI operator, are limited by the total token context your AI can reasonably process given all the other limitations reality imposes on your system. And retraining is expensive, way too expensive to be done every time some little wrong thing is found. And more-over, GPT4 is trained as a layer on top of GPT3, at least to some degree, so is the old bad data actually removed from The Matrix? Or is it still there, just supplanted by the New Correct Data, and waiting for the right, infinitely unpredictable set of input tokens to be triggered again?

I mean, I'm not an expert. I hope they have more tools than this. If they don't; I think AI organizations need to have a tsunami of lawsuits like this one hit them. Either they'll be able to develop tooling that enables alignment more quickly and with greater quality control; or all general purpose AIs will converge on the same fate that Tesla's FSD seems to be headed toward. If you can't make it perfect, and you can't demonstrate improvement at a rate faster than the rate of incoming problems, people will notice and trust will be lost.