Go read the DeepSeek R1 paper

measurablefunc · 2026-01-20T08:21:13 1768897273

Why would I do that? If you know something then quote the relevant passage & equation that says you can train code generators w/ RL on a novel language w/ little to no code to train on. More generally, don't ask random people on the internet to do work for you for free.

thorum · 2026-01-20T09:50:02 1768902602

Your other comment sounded like you were interested in learning about how AI labs are applying RL to improve programming capability. If so, the DeepSeek R1 paper is a good introduction to the topic (maybe a bit out of date at this point, but very approachable). RL training works fine for low resource languages as long as you have tooling to verify outputs and enough compute to throw at the problem.

measurablefunc · 2026-01-20T20:04:57 1768939497

So you should have no problem bringing up the exact passages & equations they use for their policies.

whimsicalism · 2026-01-20T15:39:36 1768923576

imo generally not worth it to keep going when you encounter this sort of HN archetype

whimsicalism · 2026-01-20T15:38:56 1768923536

well, that’s one way to react to being provided with interesting reading material.

measurablefunc · 2026-01-20T20:05:41 1768939541

Bring up passage that supports your claim. I'll wait.

nl · 2026-01-21T06:03:10 1768975390

Not exactly sure what you are looking for here.

That GRPO works?

> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase

Page 2 of https://arxiv.org/pdf/2402.03300

That GRPO on code works?

> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness

Page 4 of https://arxiv.org/pdf/2501.12948

measurablefunc · 2026-01-21T07:03:38 1768979018

None of those are novel domains w/ their own novel syntax & semantic validators, not to mention the dearth of readily available sources of examples for sampling the baselines. So again, where does it say it works for a programming language with nothing but a grammar & a compiler?

nl · 2026-01-21T12:21:53 1768998113

To quote you:

> here is no RL for programming languages.

and

> Either RL works & you have evidence

This is just so completely wrong, and here is the evidence.

I think everyone in this thread is just surprised you don't seem to know this.

Haven't you seen the hundreds of job ads for people to write code for LLMs to train on?

measurablefunc · 2026-01-21T16:48:55 1769014135

You're not going to get less confused by doubling down. None of your claims are valid & this is because you haven't actually tried to do what you're suggesting. Taking a grammar & compiler & RLing will get you nowhere.