More

araghuvanshi · 2025-06-16T03:29:34 1750044574

Wait this is actually pretty good! What interesting use cases have you seen so far?

araghuvanshi · on April 23, 2024

Well the same principle of false advertising re: context window sizes also applies to its inability to count, no? AI companies claim that their models can do math, so wouldn't a regular developer assume that they can also count?

And if I can't trust a so-called SOTA model to partially answer - say, recall each mention of the word "wizard" instead of just giving me the wrong answer - then why should I trust it to list out specific scenes? That's even harder to benchmark.

araghuvanshi · on April 23, 2024

Direct quote from Anthropic's website: "Opus -Our most intelligent model, which can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks."

So you tell me: if a regular developer reads the above, how can they surmise that the model which can do higher-order math can't count?

stevenhuang · on April 24, 2024

Yes, higher order math does not include arithmetic, that should not be confusing.

araghuvanshi · on April 23, 2024

I don't think that this is obvious at all. Yes, AI people who read papers on arxiv and know what "SOTA" stands for know it, but that is no longer the main user base of LLMs.

This is meant to be for the developer who doesn't fit the above profile and thinks a model that has a million token context window and "can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks" (direct quote from Anthropic's website), actually can do those things.

seangrogg · on April 24, 2024

Valid! I think the disparity is that the article appears to be written for a fairly technical crowd but the expectations appear to come from what these particular models are marketing. Most that are fine-tuning LLMs or aware of LongRoPe for extending context windows are probably consumers of research/white papers rather than marketing material.

Having read some of your other comments it appears that part of the issue is that you were marketed a 1 million token context window and research has shown that's not quite the case. That said, the article doesn't do a good job of painting that picture - it is alluded to with "all fail at this task despite having big context windows" but I think it's worth being crystal clear here that the marketing says 1m and that is disingenuous in your experience and backed by research findings.

araghuvanshi · on April 23, 2024

That's true, but the problem of long context understanding (say, "summarize each of the situations where the word 'wizard' is mentioned") remains. And that gets much closer to the insurance policy thing.

IgorPartola · on April 23, 2024

“Write me a Python program that extracts context surrounding a word from a long text, then creates a prompt to summarize the context.” Still different than the insurance policy problem.

araghuvanshi · on April 23, 2024

How much context? One sentence? Two? One paragraph? One page? It's very similar to the insurance policy problem - the text surrounding the information you're looking for, which could be surrounding it by one sentence or 10 pages, is just as important as the information itself

IgorPartola · on April 23, 2024

I mean basically this is the well known problem with LLMs: they know how to mince words but don’t understand meaning. Again, I think you didn’t present a good simple example. As presented, the Harry Potter problem is just using the wrong tool for the job and isn’t the same as the insurance policy problem.

But at the end of the day an LLM is right 80% of the time while being 100% confident 100% of the time that it gets the right answer. You can increase that 80% but I don’t see how the current breed of LLMs can learn to self doubt enough to keep trying to understand better.

araghuvanshi · on April 23, 2024

Then why do the creators of this vacuum advertise the fact that it's really good at raking? And unlike your analogy, to actually figure out that it's bad at raking you have to read a bunch of academic papers?

doug_durham · on April 23, 2024

Where have you read that creator of LLMs say their products are awesome at counting? It's just the opposite.

araghuvanshi · on April 23, 2024

I'm talking about the fact that they boast about their models having large context windows. And Anthropic says: "Opus - Our most intelligent model, which can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks." So if I were a non-AI expert, would I not infer that because it can do "higher order math tasks" it can also count?

Jerrrry · on April 23, 2024

>complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks.

>counting

Pick one.

This is very similar to the "precision" misconception regarding floating point numbers.

The answer isn't wrong, it's just imprecise.

Hallucinations are a misnomer.

You are trying to get exact integer<>word accuracy from an architecture that is innately probabilistic, and where atomically it clashes; words get tokenized, so arithmetic is difficult at a microscale - the carry bit likely won't make it to the (needed transformer) context to work, since usually, most numbers don't overflow on average when summed.

It can, however, output a small program - with high confidence - that it can self-evaluate for functional proximity, then use that to help arrive at an answer.

This is a proto-Mixture of Experts model, achieved by another hyper-visor or guard dog LLM.

araghuvanshi · on April 23, 2024

Why should I? If a person told you that they can multiply, divide, add and subtract, would you not also assume that they can at least count?

The point here is: the justifications from AI engineers for why counting vs math aren't the same task, while valid, are irrelevant because marketing never brings up the limitation in the first place. So any logical person who doesn't know a lot about AI will arrive at a logical, albeit practically incorrect conclusion.

Jerrrry · on April 24, 2024

>If a person told you that they can multiply, divide, add and subtract, would you not also assume that they can at least count?

But that's not what they said; to be fair. They said it can do complex math - not simple math, repeatedly, many times, by one inference.

The architecture just clashes against the intent too much to arrive at a useful/acceptable answer.

Had you crafted a larger prompt that recursively divides the context into n amount of separation buckets, then sum them (inverted binary tree wise), you'd likely have better luck with the carry bits tallying correctly.

araghuvanshi · on April 24, 2024

Fair, valid point. I do admit that this is far from a perfect analysis. I do hope, though, that it helps people at least classify their problems into categories where they need to design around the flaw rather than just assuming that the thing “just works”. I appreciate the discussion though!

araghuvanshi · on April 23, 2024

Do most readers know that if you give a so-called million token context model that many tokens, it'll actually stop paying attention after the first ~30k tokens? And that if they were to try to use this product for anything serious, they would encounter hallucinations and incompleteness that could have material implications?

Not everything needs to be entertaining to be useful.

lolinder · on April 23, 2024

The point is that this isn't even really useful because it's not a minimum reproduction of the problem they're actually interested in.

LLMs are bad at counting no matter what size of context is provided. If you're going to formulate a thought experiment to illustrate how an LLM stops paying attention well before the context limit, it should be an example that LLMs are known to be good at in smaller context sizes. Otherwise you might be entertaining but you're also misleading.

araghuvanshi · on April 23, 2024

Well LLMs are claimed to be good at math too, and yet they can't count. Same point with the long contexts. And our actual use case (insurance) does need it to do both.

My hope from this article is to help non-AI experts figure out when they need to design around a flaw versus believe what's marketed.

famouswaffles · on April 23, 2024

>Well LLMs are claimed to be good at math too, and yet they can't count.

You're putting a lot of weight into counting. I don't know anyone who wants to use a LLM after hearing "good at math" for counting of all things. Algebra, Calculus, Statistics, hell I used Claude 3 for Special Relativity. Those are the things people will care about when you say math, not counting.

Look, just test your use case and report that lol.

araghuvanshi · on April 24, 2024

Look man, Claude 3, GPT4 etc didn't work for my startup out of the box. I thought it would be helpful to tell others what I went through. Why hate on the truth?

famouswaffles · on April 24, 2024

Test the LLM on what you want it to do not what you think it should be able to do before what you want it to do. It's not hard to understand here and I'm not the only one telling you this.

Your article would have been very helpful if you'd simply did that but you didn't so it's not.

stevenhuang · on April 24, 2024

But LLMs are good at math, they just aren't good at arithmetic.

https://www.lesswrong.com/posts/qy5dF7bQcFjSKaW58/bad-at-ari...

araghuvanshi · on April 23, 2024

Please see my comment below, and the "Why should I care" section of the post. Yes you can count the number of times the word "wizard" is mentioned, but for tasks that aren't quite as cut-and-dry (say, listing out all of the core arguments of a 100-page legal case), you cannot just write a Python script.

The agentic approach falls apart because again, a self-querying mechanism or a multi-agent framework still needs to know where in the document to look for each subset of information. That's why I argue that you need an ontology. And at that point, agents are moot. A small 7b model with a simple prompt suffices, without any of the unreliability of agents. I suggest trying agents on an actually serious document, the problems are pretty evident. That said, I do hope that they get there one day because it will be cool.

famouswaffles · on April 23, 2024

LLMs see tokens not words and counting is a problem for them, high context or no.

Maybe the current state of the art LLM can't solve the kind of high value long context problems you have in mind but what I can tell you though is that you won't find that out by asking it to count.

araghuvanshi · on April 23, 2024

Some counterarguments: 1. If an AI company promises that their LLM has a million token context window, but in practice it only pays attention to the first and last 30k tokens, and then hallucinates, that is a bad practice. And prompt construction does not help here - the issue is with the fundamentals of how LLMs actually work. Proof: https://arxiv.org/abs/2307.03172 2. Regarding writing the code snippet: as I described in my post, the main issue is that the model does not understand the relationships between information in the long document. So yes, it can write a script that counts the number of times the word "wizard" appears, but if I gave it a legal case of similar length, how would it write a script that extracts all of the core arguments that live across tens of pages?

doug_durham · on April 23, 2024

I'd do it like a human would. If a human was reading the legal case they would have a notepad with them where they would note locations and summaries of key arguments, page by page. I'd code the LLM to look for something that looks like a core argument on each page (or other meaningful chunk of text) and then have it give a summary if one occurs. I may need to do some few shot prompting to give it understanding of what to look for. If you are looking for reliable structured output you need to formulate your approach to be more algorithmic and use the LLM for it's ability to work with chunks of text.

araghuvanshi · on April 23, 2024

Totally agree there. And that's one of my points: you have to design around this flaw by doing things like what you proposed (or build an ontology like we did, which is also helpful). And the first step in this process is figuring out whether your task falls into a category like the ones I described.

The structured output element is really important too - subject for another post though!

araghuvanshi · on April 23, 2024

Good share, thank you! Yeah I think Contextual AI has also been doing some interesting work in this area. Glossary is definitely interesting and an area we're looking into. Curious to see what work is being done with building knowledge graphs, that's another area where we've seen positive results.