It's very, interesting work. People keep wondering about production use. What th...

It's very, interesting work. People keep wondering about production use. What they don't realize (or maybe do) is that going from high-level descriptions to efficient execution on imperative GPU's is solving a series of NP-hard problems, like ASIC synthesis. A few tools, like Cray's Chapel, do a pretty good job when given more specific information, though.

My background was high-assurance systems that are provably correct. The field found that designing something for easy analysis and verification was the opposite of for efficiency. So, the solution became to verify the high-level description (formal specifications) first. Then, verify equivalence to a lower-level, efficient form which was probably designed side by side with the other one to make that easier.

Having only read the abstract and example code, I see your work being most valuable as what we called an "executable, spefication language." Those were specs or models close to original, mathematical description that could actually run for exploration and testing. They were also, in theory, easy to modify by the researcher who could focus on intended behavior rather than low-level code.

You are already using transformations to produce low-level implementation. My brainstorm would be to do a bottom-up approach of making common operations on GPU's, in libraries like Jax, and rules about how to integrate them. Then, synethesize combinations of them with constraints on combinations with the system further specializing the components as it went on. (It will be interesting to see what you did.)

I think HLS on ASIC's, Cray's Chapel, and high-level libraries in machine learning show such an approach could make implementations fast enough to test theories on small models. I think someone should build on these concepts trying to connect them to primitives in Chapel or Jax or something already designed to itself synthesize or abstract away many low-level details. Then, implement pieces of common, useful models in it with pre-synthesized implementations.

Shorter version: I think it's neat work.worth further exploration which might accelerate development and validation of ML algorithms even with sub-optimal, runtime performance.

(And I still haven't read the papers. This is just independent brainstorming I do first to see if our ideas are converging any which has its own value.)