I love doing a personal side project code review with claude code, because it doesn't beat around the bush for criticism.
I recently compared a class that I wrote for a side project that had quite horrible temporal coupling for a data processor class.
Gemini - ends up rating it a 7/10, some small bits of feedback etc
Claude - Brutal dismemberment of how awful the naming convention, structure, coupling etc, provides examples how this will mess me up in the future. Gives a few citations for python documentation I should re-read.
ChatGPT - you're a beautiful developer who can never do anything wrong, you're the best developer that's ever existed and this class is the most perfect class i've ever seen
This is exactly what got me to actually pay. I had a side project with an architecture I thought was good. Fed it into Claude and ChatGPT. ChatGPT made small suggestions but overall thought it was good. Claude shit all over it and after validating it's suggestions, I realized Claude was what I needed.
I haven't looked back. I just use Claude at home and ChatGPT at work (no Claude). ChatGPT at work is much worse than Claude in my experience.
I feel like this anecdote represents the differing incentives / philosophies of each group rather well.
I've noticed ChatGPT is rather high in its praise regardless of how valuable the input is, Gemini is less placating but still largely influenced by the perspective of the prompter, and Claude feels the most "honest" but humans are rather easy poor at judging this sort of thing.
Does anyone know if "sycophancy" has documented benchmarks the models are compared against? Maybe it's subjective and hard to measure, but given the issues with GPT 4o, this seems like a good thing to measure model to model to compare individual companies' changes as well as compare across companies.
This felt like a sane and useful case until you mentioned the access to bank account side.
I just don't see a reason to allow OpenClaw to make purchases for you, it doesn't feel like something that a LLM should have access to. What happens if you accidentally end up adding a new compromised skill?
Or it purchases you running shoes, but due to a prompt injection sends it through a fake website?
Everything else can be limited, but the buying process is currently quite streamlined, doesn't take me more than 2 minutes to go through a shopify checkout.
Are you really buying things so frequently that taking the risk to have a bot purchase things for you is worth it?
I think that's what turns this post from a sane bullish case to an incredibly risky sentiment.
I'd probably use openclaw in some of the ways you're doing, safe read-only message writing, compiling notes etc & looking at grocery shopping, but i'd personally add more strict limits if I were you.
What if... that whole post is written by AI, and the express intent of the post is to sand down our natural instincts for security, making it easier for malskill devs to take advantage?
Then it's done a horrible job! As all I could think was surely you're not making the efficiency gain you state.
It's similiar to back when Notion second brain templates became popular, there was a level at which you went - surely it's just going to be a full time job to manage this single complicated template?
You could give it access to a limited budget and review its spending periodically. Then it can make annoying mistakes but it's not going to drain your bank account or anything.
I almost feel like the amount of effort required to set up a seperate limited budget bank account and review spending periodically is equivalent to just going to a link to checkout
But I may be a lazy engineer, I definitely go by the if you do it once, don't automate, do it twice, automate approach
But don't you want the agents to book vacations and do the shopping for you!!?!
Though it would be nice if "deep research" could do the hard work of separating signal from the noise in terms of finding good quality products. But unfortunately that requires being extremely skeptical of everything written on the web and actively trying to suss out the ownership and supply chain involved, which isn't something agents can do unguided at the moment.
Yeah - and honestly, i'll probably use a agent like openclaw to do stuff like find flights, villas, find a good timing between X and Y date, get it formatted as a table and validate the websites.
Then, I can commit to the checkout process, which isn't that much labour.
For a text-based business simulator, i'd make the text far easier to read. I'm finding it to be a little to fast, with a lot of eye strain. There's a couple of techniques, including making sure that your text isn't completely black.
I'd look a little more into some of the design strategies, including smoother scrolling for text, better typography design, colors that are easier to read and more focus on the content itself.
Especially if you expect someone to read 20 minutes for an article. Just take a bit of a refresher on techniques for web readability!
Not to mention a (potentially illegal?) 100% overlay for cookies that only has an “accept” button.
EDIT: there is at least a way to reject them by clicking the link to manage cookies. Still debatable whether this is legal, but at the very least, a dark pattern.
reply