LongRoPE: Extending LLM Context Window Beyond 2M Tokens

jonathan-adly · on Feb 22, 2024

I didn't really trust Google's 1m+ context. But I trust this paper and 1m+ context would be available everywhere.

I know people complain about hardware and compute resources. But, this is like complaining about Python resource usage in the early 90's. The development complexity & resources is far more expensive than chips on the long run. I am personally re-organizing my AI organization to move away from complex RAG setups and get comfortable with long-context workflows.

Just to be clear - I also think that inference-optimized chips are the next frontier - Nvidia GPUs were designed & built in a different age than what’s going on now.

sebzim4500 · on Feb 22, 2024

Quite a few people have access to Gemini's 1M context length and it does seem to work very well.

E.g. can produce accurate scene-by-scene descriptions of hour long movies, can answer questions about very large codebases, etc.

devinprater · on Feb 22, 2024

This is a dream come true for me. Full movie descriptions! Or, uh, more likly for me, full video game playthrough descriptions!

TrueDuality · on Feb 22, 2024

Legitimately curious, what is lacking from current playthrough and movie descriptions available now? For games the only things I feel like might not be represented well in text is descriptions of the observer/player reactions and maybe details about their specific choices through a game... Kind of like turning a choose your own adventure book into a normal book by tracing one path through it?

I can see Q&A on a movie being useful but can't think of descriptions of the overall movie itself lacking... I'd definitely be worried about spoilers.

isaacfung · on Feb 22, 2024

It may provide a way for AI to learn a complete new topic(a new programming language, stock investment, sport, games) in a zero shot way and immediately apply it by just feeding a youtube playlist/udemy course into an AI model.

devinprater · on Feb 23, 2024

For a blind person, yeah some games, like Mario and Zelda, and older Final Fantasy games, are hard to understand without descriptions.

antman · on Feb 23, 2024

The problem is mostly not summarisation but needles in the haystack search with a combined answer

harshaxnim · on Feb 23, 2024

Which Google claims Gemini is the best at so far

rldjbpin · on Feb 23, 2024

This analogy disregards all the work that happens behind the scenes to make modern Python workload work.

The current landscape is moving so fast that we don't stop and work on achieving the best version of a specific approach. One can argue that both RAG and having 1M+ context length is a symptom of a disconnect between what language models are capable of and what people think they can achieve with them.

Within the next 6 months there might be yet another approach to provide LLMs custom memory, and we will be having the same discussion the way things are handled currently.

foobiekr · on Feb 22, 2024

This is 100% not true at all. Chips and electricity and the development resources would need to be enormous for your statement to be true.

It's really not even true for your python example.

xpl · on Feb 22, 2024

Is this still quadratic to prompt size? Could somebody explain?

Because paying like $100 for a 2M query wouldn't be cool for most of us, so the applications would be very limited.

maytc · on Feb 22, 2024

Stupid question but I thought transformers have an O(n^2) memory usage. With 2M tokens, won't I need dozens and dozens of GPUs just to run the base LLaMA2 models?

kristjansson · on Feb 22, 2024

FlashAttention(2)[0] reduces context-length space complexity to linear. Compute is still O(n^2) in length though, AFAIK, so we'd expect these long sequence lengths to take some time to compute.

I'm a bit out of my depth, but I think ultra-long exact-attention work like this also probably has to answer some questions about where to put the KV-cache before it can be used in practice?

[0]: https://arxiv.org/abs/2205.14135

makerdiety · on Feb 22, 2024

Maybe computer processor hacks are used? Like, it's the equivalent of finding the eigenvalues of a matrix.

I'm not as familiar with CPUs as I am with mathematical concepts. I don't know what the name for the processor bit hacking tricks is called. But that's maybe the general idea for data compression for LLMs/transformer models on CPUs, I think.

After all, notice how data compression improvements are only multiples of two. 128k tokens and 2048k tokens. There's an implementation dependent CPU optimization hack going on in there somewhere.

ErikBjare · on Feb 23, 2024

Such optimizations generally don't change the time complexity

sp332 · on Feb 22, 2024

Does this mean we can skip "tokenization" and feed raw bytes into our models now?

kristjansson · on Feb 22, 2024

That's been on the table e.g. [0][1].

But since this work depends on strong pre-trained model to extend from, I think it's open whether a training a byte-level model from scratch with similar tricks would result in the same performance (and whether any organization in the world has the GPUs and chutzpah to do pre-training at the long context lengths...)

[0]: https://arxiv.org/abs/2305.07185 [1]: https://arxiv.org/abs/2401.13660

sp332 · on Feb 22, 2024

Ok so to get a little weirder, could you use a normal tokenizer for pretraining and then go back to bytes for fine tuning along with or before the length extension?

kristjansson · on Feb 22, 2024

Tokens (be they tokens or bytes) basically atoms of input to a LM. So you'd have to figure out how to express the relationship between tokens and bytes to the LM. I guess you could do an incremental thing e.g.

- make vocab ~40k tokens + 256 tokens (for each possible byte),

- start training with 100% tokens,

- then after some portion of train budget has elapsed, randomly replace some tokens with corresponding byte-tokens,

- then ramp the fraction of replacements up so you're training on 100% byte tokens for the last xx% of training, without ever exceeding an 8k (or whatever) sequence length

- then apply the trick from TFA to get xk -> 8xk/64xk bytes of context?

but I'd guess the interesting part of a byte-transformer is multimodality, and we'd need more than a few tricks to get from ^^ to there.

Fripplebubby · on Feb 22, 2024

Do the bytes represent natural language text, or something else? If they do represent text, then it is not so weird to me.

lettergram · on Feb 22, 2024

It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler

For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.

Imnimo · on Feb 22, 2024

It's interesting that according to the results tables 5 and 6, adding more context even with LongRoPE makes predictions worse on Books3, and only gives improvement on Proof-Pile. What's special about the Proof-Pile dataset? Is there some reason we should expect it to have a lot of dependencies at ranges greater than 100k tokens? Should I be surprised or suspicious that performance is flat going from 65k to 131k, but then has a big jump to 262k?

Fripplebubby · on Feb 22, 2024

One thing you may have overlooked - table 5, the proof-pile table, only goes up to a 262k evaluation window (meaning - although the model has an extended context window of 2048k according to the method proposed, they are not feeding in that many tokens, only 262k tokens - so, about 13% of the total possible window).

Why? I think this is because books3 contains, you know, books - including some really long books, and proof-pile contains math papers and math stuff, which isn't as long.

So overall I think what you're seeing is a general trend of increasing perplexity on windows above 256k, between 256k-2048k, which is probably not so surprising - or at least, not so surprising when you consider the context of the paper, which is taking a model pre-trained with a much shorter context window and extending the context window using a novel technique. It's hard to adapt a model trained to do one thing into doing another thing, and that's what they're doing, so in that context, it tracks that the longer the context window, the worse the performance.

dmezzetti · on Feb 22, 2024

It's still up for debate as to whether long context windows are worth it. It's also not a cheap way to solve a problem.

Clearly hardware vendors love LLMs. But it's just a highly inefficient approach for a lot of problems.

wongarsu · on Feb 22, 2024

Having them available is still hugely beneficial, even if they end up too expensive to use in most production use-cases.

They would still be valuable for prototyping, where fast iteration makes it possible to learn more about the problem you are solving and whether it is even worth solving. They are also valuable for iteration-and-distillation approaches, where you can use the data generated from an expensive model to train a cheaper model.

ogogmad · on Feb 22, 2024

It seems like a very long input to an LLM can act as additional training data. A possible future is that LLMs will be used to bootstrap AIs that will be "trained" by feeding them a giant prompt at the start. Might end the era of "supervised" learning, "reinforcement" learning, numerical methods and gradient descent -- those techniques are awkward and procrustean. Imagine what you could do if you could improve a neural network - maybe to AGI level - just by talking to it in English?

So it looks like a VERY good idea. Who gets the gist of what I'm saying?

mschuster91 · on Feb 23, 2024

> Imagine what you could do if you could improve a neural network - maybe to AGI level - just by talking to it in English?

You'd get a repeat of Microsoft Tay, the short-lived 2016 chatbot that had to be taken down after not even a full day because 4chan managed to turn it into a full-blown hate spreader in hours [1].

Before something like this can be reasonably released to the Internet at large, we need to find out how to teach the "ingestion" part how to judge the input that it's being presented with... basically, AI school. The same way we teach our children that it's not OK to steal other people's stuff, to not be the one first throwing punches or to not discriminate against other people, or that there are reasonably trustworthy media and absolutely untrustworthy, we need to teach AIs. And we'd also need to figure out ways to teach an AI basic, hard truths: the Holocaust happened, the Earth is a globe not a pizza, the Earth is not hollow, and the moon landings were real.

At the moment, we're half-ass attempting that by annotated training data (and this is the true moat of OpenAI, not the weights or prompts!), but even as we invest literally millions of hours of training compute time, an average high schooler that has been trained for 18 years can reasonably pass the above-mentioned criteria to be a productive member of society.

[1] https://en.wikipedia.org/wiki/Tay_(chatbot)

CharlesW · on Feb 22, 2024

> It's still up for debate as to whether long context windows are worth it.

As someone who's run into problems related to short context windows quite often, can you explain what "worth it" means? Also, when you say "not cheap", what alternative do you have in mind?

esafak · on Feb 22, 2024

Continuous fine tuning so you don't have to pass everything in the context with every query would definitely yield a better UX.

littlestymaar · on Feb 22, 2024

Afaik the few-shots learning abilities of LLMs are much better than what you can achieve with fine tuning when you only have tiny samples.

Of course it is true for real long context but it's not clear if its going to work with the sparse context hacks intended to keep memory size low.

Having the ability to keep the same coding session forever without having the LLM forgetting all the time and making the same mistakes over and over would be a game changer.

nl · on Feb 22, 2024

I'd argue the fine-tuned UX is often worse.

People like RAG-based solutions because you can include references in your answer very easily (eg, Perplexity, or see the DAnswer "internal search" product launched today on HN). That is extremely hard to make work reliably from a fine-tuned model.

esafak · on Feb 22, 2024

I mean UX from the perspective of the person doing the search, not the engineer. I don't dispute that fine tuning is harder to implement.

nl · on Feb 22, 2024

I mean from the perspective of the person doing the search too.

They often want to see the reference - for example the LLM is constructing an answer about some company policy question it is good to include references to the actual company policy documents by URL. This is easy using RAG, but very hard to do reliably just using a fine tuned LLM.

See Perplexity for a great example of this at web scale.

esafak · on Feb 23, 2024

This is true. If LLMs recalled perfectly their purveyors would get into legal trouble :)

dmezzetti · on Feb 22, 2024

Retrieval Augmented Generation aka RAG with a relevant context. Long contexts continue to have the lost in the middle syndrome.

nl · on Feb 22, 2024

Gemini doesn't have the "lost in the middle syndrome"

dmezzetti · on Feb 23, 2024

We don't have enough data on that yet to know for sure. It was just released this week.

But nonetheless, sending 2M tokens to a LLM isn't an efficient way to solve most problems.

nl · on Feb 23, 2024

> We don't have enough data on that yet to know for sure. It was just released this week.

I think there is sufficient evidence to think it works. A bunch of people who do have access have put it through a bunch of real-world exercises to test this, and while GPT4 128K and Claude 200K both fail Gemini is passing.

I agree sending 2M tokens to a LLM is costly at the moment and that RAG has a bunch of advantages in some circumstances.

int_19h · on Feb 23, 2024

LLMs in general are likely to be a very inefficient way to solve many problems they're currently being used for. But so long as we don't have something better, brute force remains a valid approach.

pixl97 · on Feb 22, 2024

A cheap way to solve a problem has always been throwing more compute power at it.

Figuring out new algorithms is monumentally more difficult than more hardware, and even then when more efficient algorithms are found, we throw more hardware at it and get a million times more done.

szundi · on Feb 22, 2024

Billions of dollars spent on hardware kind of makes this rational to try

tkellogg · on Feb 22, 2024

It depends what problem you're solving. If it's a high frequency request, like a chat response, it's far too inefficient. Most web APIs would consider it bad practice to read 2MB of data on every request, even worse when you consider all the LLM computation. Instead, use RAG and pull targeted info out of some sort of low-latency database.

However, caching might be a sweet spot for these multi-modal and large context LLMs. Take a bunch of documents and perform reasoning tasks to distill the knowledge down into something like a knowledge graph, to be used in RAG.

Fripplebubby · on Feb 22, 2024

The cool thing about this paper is, actually it is a (relatively) cheap way to solve this particular problem (LLM on 2048k window), because they are using pre-trained models like LLaMA2 and Mistral and extending them to 2048k windows using a novel technique, rather than training a model from scratch with 2048k tokens which would be prohibitively expensive to mere mortals.

rafaelero · on Feb 23, 2024

Perplexity doesn't consistently decrease with context window size. Looks like Gemini reigns supreme for now.

Delumine · on Feb 22, 2024

When we consider the world in its entirety, the mere existence of computer components doesn't signify that we've reached the pinnacle of technological advancement. I firmly believe that our collective intelligence is deeply embedded in our language, whether it's conventional or programming-based. As we witness daily advancements in our language models, we're enhancing the efficacy of what we can currently regard as nascent artificial "intelligence".

This is precisely why programs that are nourished with innovative models, backed by substantial computational power, are capable of developing reasoning akin to Q*. Once these models start to independently foster these advancements and self-improve, we'll witness an unprecedented surge in AI development, surpassing our current capabilities.

visarga · on Feb 22, 2024

I agree with the part where you said our collective intelligence is embedded in language. Intelligence is a social process, none of us are too great alone. We like to forget that and assume it's all in our heads or in the models