I agree with the point you're making here, but it’s also funny that the description of someone passing a test but not being able to do much without a lot of human supervision is… exactly the description of a human college graduate.
Can anyone answer the chance that example tests of these questions were in its training set?
And it's just regurgitating the answers someone else wrote?
As I imagine it's a very high chance given how much uni lecturers recycle exam questions.
When I was at uni you could just get the last 5 years worth of questions from the library for almost any subject and guess what the questions were probably going to be. Often they just changed a few numbers.
Teaching undergrads is like a sausage factory, the actual intellectual value for undergrads is in the seminars, the practical value in the labs. The rest is showing you can regurgitate what you've been told.
> To the best of my knowledge—and I double-checked—this exam has never before been posted on the public Internet, and could not have appeared in GPT-4’s training data.
The exam, no, but most of the questions most certainly are. I know this because I've done extremely similar problems for homework and checked my answers online.
You can try phrasing the question in a way that it wouldn't be phrased but would still demonstrate understanding of concept.
I remember Yann LeCun gave an interview and he came up with some random question like "If I'm holding a peace of paper with both of my hands above the desk and I release one what would happen". His point was that since the LLM doesn't have a world model it wouldn't be able to answer these trivial intuitive questions unless it saw something similar in the training set. And then the interviewer tried it and it failed. That was 3.5. I've tried many variation of that class of problem with 4 and it seems to generalize basic physics concepts quite well. So maybe 4 learned basic physics ? Why couldn't it learn QM theory as well ?
For a college graduate, that is the starting point. Test results are supposed to signal that the person can learn new things. While a fresh graduate needs a lot of supervision, they should quickly become more capable and productive.
For a language model, test results are the end. They are supposed to measure what the model is capable of. If you need better performance, you must train a better model.
It the college graduates who aren’t the way you describe, those who show initiative and responsibility in their work are the best hires. So not much changes.