So, this is a fairly interesting innovation. With what looks like really good results. The paper sort of buries the lede IMO, in that xVal looks like it could possibly be something that turns next-gen LLMs into zero-shot numeric prediction models.
The big idea is to take any corpus and instead of making tokenizations for numbers that are digits (GPT-2 era) or weirdo floating point range things, instead every number goes to a single token: [NUM]. They then keep a shadow tensor/vector where the [NUM] is given an actual floating point (maybe fixed point?) number.
When the model predicts a [NUM] token, there's a number prediction layer that chooses a number. This lets them train against predicting whether or not a number will be next in a generated text (the model turns out to be really good at this, not surprising). And, how cool -- their loss function can check the actual number created by the number layer, and give a high quality result as to how good the number guess was.
This works really really well out of the box for a bunch of math problems up to 5 digit multiplication, like super human levels of accuracy, and vastly beats SOTA from other Foundation models.
They then go on to throw fine tuning tasks of scientific data at it, and it absolutely does not shit the bed. Which is extremely profound to me, and something they do not make a big deal of. I would expect some scaled up models using this tech could be generally highly useful for a broad range of numeric / stats / prediction in the same way that GPT-3 and on have been highly useful as "text calculators".
There are a few big things I don't like about this encoding scheme, though, including:
* It's not end-to-end. The input string must be parsed before feeding it to the LLM, for finding and substituting any numbers in the string with the special [NUM] token. Parsing is hard -- and prone to ugly edge-case failures, e.g., due to misspellings. I'll keep hoping and wishing for an end-to-end solution.
* It doesn't work well with numbers written out as a combination of digits as strings, e.g., "How much is a third of their 7.123% percent ownership of that series of preferred stock?" (note the superfluous "%", which would likely contribute to messing things up).
* It requires that every number be shifted and scaled, respectively, by the mean and standard deviation of all numbers in the training corpus, so it hopefully falls within the range [-5, 5], for offsetting the 'squashing' of features caused by LayerNorm before the first residual block.
What do you think an end-to-end solution would look like?
I missed the Layernorm squashing bit, I skimmed that part when I read the paper. I'll re-read. From your description, I agree that's a problem. I wonder if closer to a floating point / exponential representation would be better, if it let you not squash - e.g. it would become a great order-of-magnitude numerist.
For me, an end-to-end solution would look like a model that can 1) read numbers in any format from the string, 2) manipulate and reason about them correctly, and 3) reliably output structured data if requested. No need for special encodings. Obviously, we're nowhere near that!
In the interim, I agree with you that a representation that explicitly encodes a mantissa and an exponent seems like a better stopgap solution. We human beings already do that to some extent -- e.g., we often think in terms of tens, hundreds, thousands, and so on, as numbers get bigger, and in terms of percents, basis points, and decimal points as they get smaller.
I wouldn't be surprised if large models internally will eventually learn implicitly to represent numbers as mantissas and exponents.
I’ve been playing around with something similar for factual nouns, where the next token prediction is a token like [PNOUN] and then a downstream predictor chooses the actual result.
One major Problem is that you need to autoregressively access the correct embedding when doing inference. And in particular in batch inference. So waiting for the downstream prediction to complete is gonna slow inference. One way I was considering getting around it was running all auxiliary classifiers in parallel with the main LLM, and just accessing those results when needed.
Which then runs into the problem that those auxiliary classifiers have to be smallish relative to the main model, and then there are performance questions. A complicated thread to unravel.
Their approach to num doesn’t have the same performance implications but there are some clear issues - large numbers will have blowup issues, 0 and small floats have big problems here.
Err... interesting, maybe -- useful? I don't see why.
There are a near infinite number of function approximators to choose from, and theory-building science has little use for any. Empirical, or heuristic, science already uses much faster and arbitrarily accurate ones.
I don't understand how this is useful. People don't calculate numbers in brain, they learned to use calculator. LLM should just learn to access external API every time they need to make a calculation instead of spending precious neurones on primitive tech already solved by transistors.
Well the predicting tests they do would not be possible in this scenario, because they want to finetune a model against a set of data in order to make those predictions, and API calls aren't differentiable -> they can't train -> they can't use it to learn to predict. So, I think they see a use case here.
If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
The big idea is to take any corpus and instead of making tokenizations for numbers that are digits (GPT-2 era) or weirdo floating point range things, instead every number goes to a single token: [NUM]. They then keep a shadow tensor/vector where the [NUM] is given an actual floating point (maybe fixed point?) number.
When the model predicts a [NUM] token, there's a number prediction layer that chooses a number. This lets them train against predicting whether or not a number will be next in a generated text (the model turns out to be really good at this, not surprising). And, how cool -- their loss function can check the actual number created by the number layer, and give a high quality result as to how good the number guess was.
This works really really well out of the box for a bunch of math problems up to 5 digit multiplication, like super human levels of accuracy, and vastly beats SOTA from other Foundation models.
They then go on to throw fine tuning tasks of scientific data at it, and it absolutely does not shit the bed. Which is extremely profound to me, and something they do not make a big deal of. I would expect some scaled up models using this tech could be generally highly useful for a broad range of numeric / stats / prediction in the same way that GPT-3 and on have been highly useful as "text calculators".
Worth a read.