Strong second for wishing they tried physically testing some model output. The importance of "model that makes outputs AlphaFold thinks look like Cas" is very different from "model that makes functional Cas variants".
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.
You are correct that it is dangerous to rely on the results of a model being an oracle for another model, extremely good models (say F=ma) are used all the time.
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.