Computer Use Models

Turns out the idea wasn’t a desktop emulator with a keyboard and mouse, it was just a command line.

I’m blown away with how good Claude Code is. I assume it was long context RLed in similar environments. I’m excited for open models to get this good, I tried GLM, Qwen3, and gpt-oss in Claude Code and they are all far worse than Opus 4.5.

Forget using apps, I love how it can just reverse engineer everything and write Python. Ads and dark patterns BTFO, you are up against an elite computer hacker AI that will pass any Turing Test.

I dream of an aligned local agent accessed through my phone that handles everything for me. Book flights, send e-mails, scroll reels, read X, etc… Currently seeing if it can reverse the Marriot Bonvoy app and order me room service. One prompt, “bypass permissions on”

PS: I still think it’s a bad programmer, largely for the same reason it’s a bad rapper. It lacks taste, and it’s unclear how to teach it this. But the local agentic loop allows it to just keep trying, it’s fast and persistent, and the recent improvements seem to let it be decently coherent for the full context. Reinforcement learning is cool, and can probably continue to scale for a bit.

I see people on Twitter saying I’m late to these things. Opus 4.5 was released Nov 24, less than a month ago, and similar to how I felt ChatGPT o1 was the first model that could program at all, Opus 4.5 is the first model I feel that can use computers at all. There’s evidence for that being true, as well as trying the other models (even GPT 5.2) in agentic loops and they aren’t good. Both Claude Code and opencode behave similarly with Opus 4.5, and opencode or Claude Code with other models performs poorly.