I didn't really trust Google's 1m+ context. But I trust this paper and 1m+ context would be available everywhere.
I know people complain about hardware and compute resources. But, this is like complaining about Python resource usage in the early 90's. The development complexity & resources is far more expensive than chips on the long run. I am personally re-organizing my AI organization to move away from complex RAG setups and get comfortable with long-context workflows.
Just to be clear - I also think that inference-optimized chips are the next frontier - Nvidia GPUs were designed & built in a different age than what’s going on now.
Legitimately curious, what is lacking from current playthrough and movie descriptions available now? For games the only things I feel like might not be represented well in text is descriptions of the observer/player reactions and maybe details about their specific choices through a game... Kind of like turning a choose your own adventure book into a normal book by tracing one path through it?
I can see Q&A on a movie being useful but can't think of descriptions of the overall movie itself lacking... I'd definitely be worried about spoilers.
It may provide a way for AI to learn a complete new topic(a new programming language, stock investment, sport, games) in a zero shot way and immediately apply it by just feeding a youtube playlist/udemy course into an AI model.
This analogy disregards all the work that happens behind the scenes to make modern Python workload work.
The current landscape is moving so fast that we don't stop and work on achieving the best version of a specific approach. One can argue that both RAG and having 1M+ context length is a symptom of a disconnect between what language models are capable of and what people think they can achieve with them.
Within the next 6 months there might be yet another approach to provide LLMs custom memory, and we will be having the same discussion the way things are handled currently.
Stupid question but I thought transformers have an O(n^2) memory usage. With 2M tokens, won't I need dozens and dozens of GPUs just to run the base LLaMA2 models?
FlashAttention(2)[0] reduces context-length space complexity to linear. Compute is still O(n^2) in length though, AFAIK, so we'd expect these long sequence lengths to take some time to compute.
I'm a bit out of my depth, but I think ultra-long exact-attention work like this also probably has to answer some questions about where to put the KV-cache before it can be used in practice?
Maybe computer processor hacks are used? Like, it's the equivalent of finding the eigenvalues of a matrix.
I'm not as familiar with CPUs as I am with mathematical concepts. I don't know what the name for the processor bit hacking tricks is called. But that's maybe the general idea for data compression for LLMs/transformer models on CPUs, I think.
After all, notice how data compression improvements are only multiples of two. 128k tokens and 2048k tokens. There's an implementation dependent CPU optimization hack going on in there somewhere.
But since this work depends on strong pre-trained model to extend from, I think it's open whether a training a byte-level model from scratch with similar tricks would result in the same performance (and whether any organization in the world has the GPUs and chutzpah to do pre-training at the long context lengths...)
Ok so to get a little weirder, could you use a normal tokenizer for pretraining and then go back to bytes for fine tuning along with or before the length extension?
Tokens (be they tokens or bytes) basically atoms of input to a LM. So you'd have to figure out how to express the relationship between tokens and bytes to the LM. I guess you could do an incremental thing e.g.
- make vocab ~40k tokens + 256 tokens (for each possible byte),
- start training with 100% tokens,
- then after some portion of train budget has elapsed, randomly replace some tokens with corresponding byte-tokens,
- then ramp the fraction of replacements up so you're training on 100% byte tokens for the last xx% of training, without ever exceeding an 8k (or whatever) sequence length
- then apply the trick from TFA to get xk -> 8xk/64xk bytes of context?
but I'd guess the interesting part of a byte-transformer is multimodality, and we'd need more than a few tricks to get from ^^ to there.
For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.
It's interesting that according to the results tables 5 and 6, adding more context even with LongRoPE makes predictions worse on Books3, and only gives improvement on Proof-Pile. What's special about the Proof-Pile dataset? Is there some reason we should expect it to have a lot of dependencies at ranges greater than 100k tokens? Should I be surprised or suspicious that performance is flat going from 65k to 131k, but then has a big jump to 262k?
One thing you may have overlooked - table 5, the proof-pile table, only goes up to a 262k evaluation window (meaning - although the model has an extended context window of 2048k according to the method proposed, they are not feeding in that many tokens, only 262k tokens - so, about 13% of the total possible window).
Why? I think this is because books3 contains, you know, books - including some really long books, and proof-pile contains math papers and math stuff, which isn't as long.
So overall I think what you're seeing is a general trend of increasing perplexity on windows above 256k, between 256k-2048k, which is probably not so surprising - or at least, not so surprising when you consider the context of the paper, which is taking a model pre-trained with a much shorter context window and extending the context window using a novel technique. It's hard to adapt a model trained to do one thing into doing another thing, and that's what they're doing, so in that context, it tracks that the longer the context window, the worse the performance.
Having them available is still hugely beneficial, even if they end up too expensive to use in most production use-cases.
They would still be valuable for prototyping, where fast iteration makes it possible to learn more about the problem you are solving and whether it is even worth solving. They are also valuable for iteration-and-distillation approaches, where you can use the data generated from an expensive model to train a cheaper model.
It seems like a very long input to an LLM can act as additional training data. A possible future is that LLMs will be used to bootstrap AIs that will be "trained" by feeding them a giant prompt at the start. Might end the era of "supervised" learning, "reinforcement" learning, numerical methods and gradient descent -- those techniques are awkward and procrustean. Imagine what you could do if you could improve a neural network - maybe to AGI level - just by talking to it in English?
So it looks like a VERY good idea. Who gets the gist of what I'm saying?
> Imagine what you could do if you could improve a neural network - maybe to AGI level - just by talking to it in English?
You'd get a repeat of Microsoft Tay, the short-lived 2016 chatbot that had to be taken down after not even a full day because 4chan managed to turn it into a full-blown hate spreader in hours [1].
Before something like this can be reasonably released to the Internet at large, we need to find out how to teach the "ingestion" part how to judge the input that it's being presented with... basically, AI school. The same way we teach our children that it's not OK to steal other people's stuff, to not be the one first throwing punches or to not discriminate against other people, or that there are reasonably trustworthy media and absolutely untrustworthy, we need to teach AIs. And we'd also need to figure out ways to teach an AI basic, hard truths: the Holocaust happened, the Earth is a globe not a pizza, the Earth is not hollow, and the moon landings were real.
At the moment, we're half-ass attempting that by annotated training data (and this is the true moat of OpenAI, not the weights or prompts!), but even as we invest literally millions of hours of training compute time, an average high schooler that has been trained for 18 years can reasonably pass the above-mentioned criteria to be a productive member of society.
> It's still up for debate as to whether long context windows are worth it.
As someone who's run into problems related to short context windows quite often, can you explain what "worth it" means? Also, when you say "not cheap", what alternative do you have in mind?
Afaik the few-shots learning abilities of LLMs are much better than what you can achieve with fine tuning when you only have tiny samples.
Of course it is true for real long context but it's not clear if its going to work with the sparse context hacks intended to keep memory size low.
Having the ability to keep the same coding session forever without having the LLM forgetting all the time and making the same mistakes over and over would be a game changer.
People like RAG-based solutions because you can include references in your answer very easily (eg, Perplexity, or see the DAnswer "internal search" product launched today on HN). That is extremely hard to make work reliably from a fine-tuned model.
I mean from the perspective of the person doing the search too.
They often want to see the reference - for example the LLM is constructing an answer about some company policy question it is good to include references to the actual company policy documents by URL. This is easy using RAG, but very hard to do reliably just using a fine tuned LLM.
See Perplexity for a great example of this at web scale.
> We don't have enough data on that yet to know for sure. It was just released this week.
I think there is sufficient evidence to think it works. A bunch of people who do have access have put it through a bunch of real-world exercises to test this, and while GPT4 128K and Claude 200K both fail Gemini is passing.
I agree sending 2M tokens to a LLM is costly at the moment and that RAG has a bunch of advantages in some circumstances.
LLMs in general are likely to be a very inefficient way to solve many problems they're currently being used for. But so long as we don't have something better, brute force remains a valid approach.
A cheap way to solve a problem has always been throwing more compute power at it.
Figuring out new algorithms is monumentally more difficult than more hardware, and even then when more efficient algorithms are found, we throw more hardware at it and get a million times more done.
It depends what problem you're solving. If it's a high frequency request, like a chat response, it's far too inefficient. Most web APIs would consider it bad practice to read 2MB of data on every request, even worse when you consider all the LLM computation. Instead, use RAG and pull targeted info out of some sort of low-latency database.
However, caching might be a sweet spot for these multi-modal and large context LLMs. Take a bunch of documents and perform reasoning tasks to distill the knowledge down into something like a knowledge graph, to be used in RAG.
The cool thing about this paper is, actually it is a (relatively) cheap way to solve this particular problem (LLM on 2048k window), because they are using pre-trained models like LLaMA2 and Mistral and extending them to 2048k windows using a novel technique, rather than training a model from scratch with 2048k tokens which would be prohibitively expensive to mere mortals.
When we consider the world in its entirety, the mere existence of computer components doesn't signify that we've reached the pinnacle of technological advancement. I firmly believe that our collective intelligence is deeply embedded in our language, whether it's conventional or programming-based. As we witness daily advancements in our language models, we're enhancing the efficacy of what we can currently regard as nascent artificial "intelligence".
This is precisely why programs that are nourished with innovative models, backed by substantial computational power, are capable of developing reasoning akin to Q*. Once these models start to independently foster these advancements and self-improve, we'll witness an unprecedented surge in AI development, surpassing our current capabilities.
I agree with the part where you said our collective intelligence is embedded in language. Intelligence is a social process, none of us are too great alone. We like to forget that and assume it's all in our heads or in the models
I know people complain about hardware and compute resources. But, this is like complaining about Python resource usage in the early 90's. The development complexity & resources is far more expensive than chips on the long run. I am personally re-organizing my AI organization to move away from complex RAG setups and get comfortable with long-context workflows.
Just to be clear - I also think that inference-optimized chips are the next frontier - Nvidia GPUs were designed & built in a different age than what’s going on now.