Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.
- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).
- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.
We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.
My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.
All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.
Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.
The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here
In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.
I think that's because Opus 4.6 has more "initiative".
Opus 4.6 can be quite sassy at times, the other day I asked it if it were "buttering me up" and it candidly responded "Hey you asked me to help you write a report with that conclusion, not appraise it."
I got the Max subscription and have been using Opus 4.6 since, the model is way above pretty much everything else I've tried for dev work and while I'd love for Anthropic to let me (easily) work on making a hostable server-side solution for parallel tasks without having to go the API key route and not have to pay per token, I will say that the Claude Code desktop app (more convenient than the TUI one) gets me most of the way there too.
I started using it last week and it’s been great. Uses git worktrees, experimental feature (spotlight) allows you to quickly check changes from different agents.
I hope the Claude app will add similar features soon
Instead of having my computer be the one running Claude Code and executing tasks, I might want to prefer to offload it to my other homelab servers to execute agents for me, working pretty much like traditional CI/CD, though with LLMs working on various tasks in Docker containers, each on either the same or different codebases, each having their own branches/worktrees, submitting pull/merge requests in a self-hosted Gitea/GitLab instance or whatever.
However, you're not supposed to really use it with your Claude Max subscription, but instead use an API key, where you pay per token (which doesn't seem nearly as affordable, compared to the Max plan, nobody would probably mind if I run it on homelab servers, but if I put it on work servers for a bit, technically I'd be in breach of the rules):
> Unless previously approved, Anthropic does not allow third party developers to offer claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead.
It just feels a tad more hacky than just copying an API key when you use the API directly, there is stuff like https://github.com/anthropics/claude-code/issues/21765 but also "claude setup-token" (which you probably don't want to use all that much, given the lifetime?)
Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct.
Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!
I haven't kept up with the Claude plays stuff, did it ever actually beat the game? I was under the impression that the harness was artificially hampering it considering how comparatively more easily various versions of ChatGPT and Gemini had beat the game and even moved on to beating Pokemon Crystal.
The Claude Plays Pokemon stream with a minimal harness is a far more significant test of model intelligence compared to the Gemini Plays Pokemon stream (which automatically maintains a map of everything that has been seen on the current map) and the GPT Plays Pokemon stream (which does that AND has an extremely detailed prompt which more or less railroads the AI into not making this mistakes it wants to make). The latter two harnesses have become too easy for the latest generations of model, enough so that they're not really testing anything anymore.
Claude Plays Pokemon is currently stuck in Victory Road, doing the Sokoban puzzles which are both the last puzzles in the game and by far the most difficult for AIs to do. Opus 4.5 made it there but was completely hopeless, 4.6 made it there and is is showing some signs of maaaaaybe being eventually bruteforce through the puzzles, but personally I think it will get stuck or undo its progress, and that Claude 4.7 or 5 will be the one to actually beat the game.
There were no such writeups, 99% of the discussion about difficulties in Crystal were in twitch and discord chats where Google doesn't scrape. (It hadn't yet gotten the public attention that Claude and Gemini's runs of Pokemon Red and Blue have gotten.)
That said, this writeup itself will probably be scraped and influence Gemini 4.
It's hard to say for sure because Gemini 3 was only tested with this prompt. But for Gemini 2.5, which is who the prompt was originally written for, yes this does cut down on bad assumptions (a specific example: the puzzle with Farfetch'd in Ilex Forest is completely different in the DS remake of the game, and models love to hallucinate elements from the remake's puzzle if you don't emphasize the need to distinguish hypothesis from things it actually observes).
Exactly what I was going to post. Optimizations like loop unrolling slow down the N64 because keeping the code size small is the most important factor. I think even compilers of the time got this wrong, not just modern ones.
The one that really blows me away is how KazeEmmanuar explained the software-controlled cache. Using it well would involve either manually making calls to load/invalidate it or writing a compiler backend that replaces loading data from memory into registers with instructions to load specific memory address ranges into cache.
The whole thing reminds me in a fuzzy way that I don't yet fully comprehend of register-memory architecture.
Claude almost universally reacts to everything with a positive exclamation as its first sentence, regardless of whether it's good or bad. If you don't believe me, just watch https://www.twitch.tv/claudeplayspokemon for about three minutes and you'll get the idea.
Alternatively, look at the system prompt, where Anthropic attempted to get it to stop doing this:
> Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly.
https://docs.anthropic.com/en/release-notes/system-prompts#a...
This problem seems highly specific to Claude. It's not exactly sycophancy so much as it is a strong bias towards this exact type of reaction to everything.
With 2, the real problem is that approximately 0% of the OpenAI employees actually believed in the mission. Pretty much every single one of them signed the letter to the board demanding that if the company's existence ever comes into conflict with humanity's survival, the company's existence comes first.
That's the reality of every organization if it survives long enough.
Checks-and-balances need to be robust enough to survive bad people. Otherwise, they're not checks-and-balances.
One of the tricks is a broad range of diverse stakeholders with enforcement power. For example, if OpenAI does anything non-open, you'd like organizations FSF, CC, and similar to be represented on their board and to be able to enforce those rules in court.