Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-4 Takes a New Midterm and Gets an A (betonit.substack.com)
43 points by bumbledraven on April 7, 2023 | hide | past | favorite | 96 comments


> In addition, the UBI creates bad incentives, and requires enormous taxes if funded at an “acceptable” level.

> A UBI experiment, in contrast, might be a good way to convince people of the folly of the UBI. For a modest cost, you could persuasively demonstrate the strong disincentive effects, reducing support for this massive waste of resources. Of course, this assumes that fans of the UBI actually care about evidence!

> ...

> Score: 10/15. GPT-4 fails to explain that a UBI is bad by EA standards because it does the opposite of targeting. “Might not have the same impact” is a gross understatement. It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided.

The author seems VERY against UBI. I don't think it is really fair to ChatGPT to mark it down because it does not also strongly oppose UBI.

I also don't know what "bad incentives" UBI creates. UBI literally creates incentives for the poor to work because it means working does not deprive them of governmental benefits.


ChatGPT lost points because of the alignment problem - ChatGPT's explanation didn't align with the author's political ideology!

But yes, personally I know several people on disability who would love to try working part-time, but making more than the (extremely low) cutoff puts you at risk of losing disability income and even your health insurance entirely. And a lot of people are in the boat of being sporadically able to work, or able to work a little, but not enough to survive.

SS(D)I also forces people to divorce their spouses because spousal income counts 100% against your own income. It's an extremely cruel system.


It's a cruel system because of the history of massive fraud.


Can you provide a citation as to how prevalent fraud was and what percentage of total benefits the fraud incurred as a cost?


Also, the cost of fraud must be balanced against the cost of enforcement, both in terms of the cost to run the enforcement apparatus, and the cost of not providing benefits to people who are owed them because of the onerous requirements to get those benefits!


There’s massive tax fraud at the high end too, does that mean you agree we should be cruel to anyone earning over 6 digits just in case they are committing fraud?


If I know the government is now giving my tenants $500/month, the rent is going up $300/month. As an example of bad incentives.


In a free market, prices are set by supply and demand. If you decided to raise rent $300/mo and the landlord next door chose to not raise them, your tenants will move.

Obviously, if you live in a city without a free market in housing resulting in an artificially reduced supply (e.g. San Francisco), your strategy will work.

Or, if you and all the landlords in the area collude (forming a cartel), you can also make this work, assuming you can avoid legal action.


If every landlord knows every tenant suddenly has more money, they should all be raising their rent. And they will compete in doing so, and the renters will compete for space/price, and because they all have more money they can/will bid higher.

You don’t need collusion for prices to increase with UBI, you just need supply and demand to do its thing


If there is ample supply in the housing market, as there is in a non-artificially constrained housing market, every landlord is incentivized to just slightly undercut his fellow landlords, and supply and demand will do its thing.

Your example of "bidding higher" doesn't make sense when there is ample supply. My family in Arkansas is always mystified when I tell them that people bid up house prices here on the West Coast. In Arkansas, houses typically sell for 0-10% less than the list price, with no bidding wars. But that is because there is ample supply.

To your point, it is possible that the increased purchasing power does shift people's marginal propensity to consume housing. i.e. if you give people $1000/mo extra, they want to buy more housing (more space, better location, more luxury, etc).

But that is a different situation, because now the consumer is getting something for their money, rather than the rent being raised on what they were already getting.


> it is possible that the increased purchasing power does shift people's marginal propensity to consume housing. i.e. if you give people $1000/mo extra, they want to buy more housing

Giving people money drives inflation by driving demand. When demand increases rent goes up. People may want to buy more/better housing but they will find that $1000/month does not go as far as it did before UBI.


> as there is in a non-artificially constrained housing market

Where is this glorious housing market you speak of? I would love to see it!


I mention in my comment that I am from Arkansas, which has among the cheapest housing in the nation. My 1940s, 1500 sqft house in Seattle would probably have been considered a teardown in Arkansas, or have cost < $50K.

Go checkout redfin there if you want to be a little depressed about West Coast prices. Check out when you can get for the median SF house ($1.3M).

Gated mediterranean style 8000 sqft mansion on land with pool on a golf course / lake in the middle of the city:

https://www.redfin.com/AR/North-Little-Rock/5-Edenwood-Ln-72...

Southern style 7300 sqft mansion in the middle of the city for far below SF median

https://www.redfin.com/AR/North-Little-Rock/18-Heritage-Park...

Obviously, the downside is you are in Arkansas and not SF! I moved to Seattle for a reason. The BBQ there is quite good, though, and I miss it.


Chicago. Plenty of housing here


This is what will create the disincentive to work. If you knew that working harder would get you extra money, but that some shithead landlord would see that and just raise your rent to take it from you, why work harder?


I own your building and three more near by. Me and my slumlord buddies all use the same Landlord Revenue Management software and it told us all that the market will bear a $300/month increase across the board. We'll paint the outside and freshen up the landscape.


Yes, my comment mentions that if you engage in cartel behavior you can make this work, assuming you don't get into legal trouble.

There is an ongoing lawsuit about this:

https://www.reuters.com/legal/litigation/fight-is-control-re...

https://www.propublica.org/article/yieldstar-rent-increase-r...


You realize I'm presenting a hypothetical scenario and I'm not even a landlord, right?

I was just playing into your scenario there, but that is not required. I suspect every property owner or manager has a different tolerance for empty units. If I can over-charge for most of the building then I can let a unit sit empty rather than lower the rent, this only works for so long.

When people can spend more they will and anyone in a rent-seeking business is going to see UBI as money on the table.


In a scenario where a UBI made sense, it isn't clear that you would have that leverage.

Like if there is so much productivity that there isn't work to go around, building additional housing is a matter of choice, not a matter of cost.

Same with pretty much every day to day need. Of course there wouldn't be unlimited yachts or whatever.


If you think everyone is getting a net $300 then you've misunderstood the proposal of UBI. Taxes increase and government benefits decrease to compensate.


While I think the action is hostile and clearly unethical... I'm sorry you're getting down voted, for your hypothetical example... seems HN doesn't like different opinions.

This is a good answer.


> A UBI experiment, in contrast, might be a good way to convince people of the folly of the UBI. For a modest cost, you could persuasively demonstrate the strong disincentive effects, reducing support for this massive waste of resources.

This response is insidious, but not because of the professor's position on UBI.

If we already had overwhelming evidence that a new UBI study would "persuasively demonstrate the strong disincentive effects", then that evidence would by itself "persuasively demonstrate the strong disincentive effects." Furthermore, if such evidence existed, the proponents of UBI would already be ignoring it.

Therefore, the first strike against the suggested response is that it lacks basic logic: either the professor lacks enough evidence to knows the outcome of the study before it is done, or the professor is not justified in asserting that more such evidence will be persuasive.

The second strike is this: If the professor expected students to be familiar with the evidence against UBI, then we should expect the professor to be looking for the students to demonstrate that familiarity by referencing it in support of their conclusion (i.e. that a UBI study would have a certain outcome.) That the professor does not expect this indicates that he is not training students to make arguments based on evidence.

The final strike is the evaluation of the UBI as a persuasive tool. Even if we assume that the professor is correct on both counts (i.e. that a UBI study will be negative and persuasive), that by itself isn't a justification for a "modest cost" in an Effective Altruism framework. I wouldn't spend a "modest amount" to convince some people that the earth is not flat, for example, because it would have no real impact on any of the metrics Effective Altruists use. Because the professor stops short of connecting the educational outcome to the EA metrics, he has presented a failure of a response as a suggested response.

So to summarize: the response is internally incoherent, expects authoritative phrasing over evidence, and ultimately fails to directly connect the claims to the question.


> I also don't know what "bad incentives" UBI creates. UBI literally creates incentives for the poor to work because it means working does not deprive them of governmental benefits.

The "B" in UBI means "enough to live off of". I'd expect more people to take advantage of it for the freedom and stop working than would start working.


Exactly, subtracting points for failing to meet his bias. Arguably, gpt-4 gave the better, more neutral answer here. Suggesting an experiment might be enlightening either way.


[flagged]


> Being pro UBI is very strong evidence of lack of intelligence, even if your reason for supporting UBI is purely self-serving.

Do you have an actual argument to go along with this insult?

> It’s like killing old ladies for spare change in front of their sons biker gang and police level of amoral stupidity.

I can't even pares this analogy; it's nonsensical to me.


Even Elon thinks UBI is almost inevitable once AI/Automation makes most current human work unnecessary. I'm inclined to agree though I doubt it'll be a smooth path to get there (and may not even happen in my lifetime).


UBI /is/ a government benefit. If they were able to subside on their government benefits before then all that UBI does is give them even more generalized benefits thus removing even the minimal incentives they already had to work.


Just because it's newly created doesn't mean that the structure of the language and the concepts it represents are actually new.

It's clear that whatever tests he writes cover well established and understood concepts.

This is where I believe people are missing the point. GPT4 is not a general intelligence. It is a highly overfit model, but it's overfit to literally every piece of human knowledge.

Language is humanities way of modelling real world concepts. So GPT is able to leverage the relationships we create through our language to real world concepts. It's just learned all language up until today.

It's an incredible knowledge retrieval machine. It can even mimick how our language is used to conduct reasoning very well.

It can't do this efficiently, nor can it actually stumble upon a new insight because it's not being exposed in real time to the real world.

So, this professors 'new' test is not really new. It's just a test that fundamentally has already been modelled.


Watching posts shift in real time is very entertaining. First it's not generally intelligent because it can't tackle new things then when it obviously does its not generally intelligent because it's overfit.

You've managed to essentially say nothing of substance. So it passes because structure and concepts are similar. okay. are students preparing for tests working with alien concepts and structures then because i'm failing to see the big difference here.

A model isn't overfit because you've declared it so. and unless GPT-4 is several trillion parameters, general overfitting is severely unlikely. But i doubt you care about any of that. Can you devise a test to properly asses what you're asserting ?


I have no idea what is shifting in real time. I formed this opinion of GPT4 by running it through several benchmarks and making adjustments to them, so my view is empirical and it was formed 1 week after it came out.

Your post says nothing of substance because it offers no substantial rebuttal and seems to just attack a position by creating a hand-waved argument without any clear understanding of how parameters in-fact impact a model's outputs.

You also completely missed my point.


Oh several benchmarks ? Wow. Please do tell what these benchmarks were and how you evaluated them. Should surely be easy enough to replicate.


You seem to have a serious attitude problem in your responses so this is my last one.

It's propietary company evaluation data, and it's for a specific domain related to software development, a domain that OpenAI is actively attempting to improve performance for.

Anyways enjoy your evening. If you want to actually have a reasonable discussion without being unpleasant I'd be happy to discuss further.


How does it empirically prove general overfitness ?

People study from books or from teachers or other sources of knowledge and internalize it and relate it to other concepts as well, and no one considers that to be a form of overfitting.

You basically said what amounts to "it overfits to concepts" which is honestly quite ridiculous. Not only is it a standard humans would fail, that's not what overfit is generally taken to mean.


I agree with the parent post. I can get ChatGPT to solve a basic world problem but if I add a small wrinkle to it that a human would understand it fails hard. Overfitted seems apt.

Yeah it's amazing, but it's not AGI.


Stop confusing ChatGPT with GPT-4. Most common rookie mistake. GPT-4 is way stronger at 'solving problems' than ChatGPT. I was baiting ChatGPT with basic logical or conversion problems, I stopped doing that with GPT-4, since it would take too much effort to beat it.


Possibly rookie mistake?

https://chat.openai.com/chat

What is this? Is this ChatGPT, or GPT4? I'm talking about my experiences last week with this URL.


Are you paying $20/month and selecting the GPT-4 from the drop-down menu?


It's trivially easy even with gpt-4.

> Please respond with the number of e's in this sentence.

> There are 8 "e" characters in the sentence "Please respond with the number of e's in this sentence."


Dealing with words on the level of their constituent letters is a known weakness of OpenAI’s current GPT models, due to the kind of input and output encoding they use. The encoding also makes working with numbers represented as strings of digits less straightforward than it might otherwise be.

In the same way that GPT-4 is better at these things than GPT-3.5, future GPT models will likely be even better, even if only by the sheer brute force of their larger neural networks, more compute, and additional training data.

(To see an example of the encoding, you can enter some text at https://platform.openai.com/tokenizer. The input is presented to GPT as a series of integers, one for each colored block.)


Almost like it has a kind of dyslexia when it comes to "looking inside" tokens.

If you instead ask it to write a Python program to do the same job, it will do it perfectly.


First GPT-4.

Second, You're going to have to give specific examples on what a small wrinkle is. I've seen "can't solve variation of common word problem" but that's a failure mode of people too. and if you reword the question so it doesn't bias common priors or even telling it it's making an assumption wrong, it often gets it right.


OK write your basic word problem including its small wrinkle, so that the parent commenter can be entertained when GPT-5 solves it.


> Watching posts shift in real time is very entertaining. First it's not generally intelligent because it can't tackle new things then when it obviously does its not generally intelligent because it's overfit.

This wasn't new in the same way that making any test about Romeo and Juliet isn't new. You're still going to the same sources for the answer. It's the exact same goalpost.


Ah the good old "it's not me it's the test" argument. These systems are not just next token predictors, they learn complex algorithms and can perform general computation, its just so happens that by asking them to next-token predict the internet they learn a bunch of smart ways to compress everything, potentially in a way similar to how we might use a general concept to avoid memorizing a lookup table. Please have a look at https://arxiv.org/pdf/2211.15661 and https://mobile.twitter.com/DimitrisPapail/status/16208344092.... We don't understand everything that's going on yet but it would be foolish to discount anything at this stage, or to state much of anything with any degree of confidence (and that stands for both sides of the opinion spectrum). Also these systems aren't exposed to the real world today, but this will be untrue very soon https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...


I never said: - "it's not me it's the test" - "These systems are not just next token predictors"

None of the papers or blogs you've shared offer any points that actually rebutt what I'm saying.

And yes, we will eventually have them work in real time. Can't wait.


Don't students prepare for tests by studying past instances of them?

"Teaching the test" (aka overfitting of human students at the expense of "real" learning) is a common complaint about our current education system.

Do you think it doesn't "deserve" an A here?


Did I say that?

The OP's post was saying it's somehow able to solve something new. It's showing a severe misunderstanding how how language modelling works.


I think the hallucinations show that it's not simply overfit to all of human knowledge. To hallucinate, there is a certain amount of generalization and information overlap that is necessary.


I’m working in a related area and I’m rather curious about this point. In what way is GPT-4 overfit? Does overfit in this context mean the conventional: validation loss went up with additional training, or something special?


More specifically validation loss is irrelevant when you can't even sample out of distribution anymore.


This is an unusual comment to say the least. It suggests that unless GPT4 can somehow independently derive facts entirely on its own, then it's nothing more than an overfit model, almost as if to say that it's basically just a kind of sophisticated search engine on top of a glorified Wikipedia.

Of course that's not actually true, people don't independently invent knowledge either. People study from books or from teachers or other sources of knowledge and internalize it and relate it to other concepts as well, and no one considers that to be a form of overfitting.


What would a "new" test look like then?


I would certainly be peeved if I showed up to a midterm that asked questions outside of existing human knowledge.


"I didn't make it into the university I wanted because I didn't invent enough new mathematics during the entry exams."


Given that OpenAI were THEMSELVES surprised by how even GPT-3 ended up, it’s always funny to see HN know-it-alls pipe up with all the answers.

These sorts of poorly formed faux-philosophical arguments against LLMs have become the new domain of people that confuse blindly acting skeptical with actual intelligence.

Ironic.

This latest generation of AI quite rightfully raises questions and challenges assumptions about what it means to be intelligent. It quite rightfully challenges our assumptions about what can be accomplished with language. And, thank God, it quite rightfully challenges assumptions many have made about what sets humanity apart from everything else.


> poorly formed faux-philosophical arguments against LLMs

There's a misunderstanding here. The post you're replying to is not an argument against LLMs. It's an argument about what LLMs can and cannot do, what their fundamental capabilities are, and so forth.

It's very clear that if you need a system to provide answers based on a substantial body of human writing, LLMs are totally awesome. But that doesn't mean, in and of itself, that they can X or that they can Y.


> Given that OpenAI were THEMSELVES surprised by how even GPT-3 ended up,

Yeah and they have 0 incentive to overhype their takes. OpenAI has already slanted already impressive data in the past to make it more "hype building" for the general public, when a more scientific study style reading is "this is really cool, here's where it still fails". I am very confident shit like that is the same.


Are these questions typical/representative for US education?

To me, they seem strongly tainted by the professors worldview in a way that seems barely acceptable for a classroom setting and completely inappropriate for an exam.

Republican complaints about education being an avenue for indoctrination now make much more sense to me (assuming that this is actually common).


This is from the GMU economics department. I actually agree with a lot of the worldview of the people there, but it is clearly a political project, not an academic one. Noteworthy wealthy political activists have donated enough to the department to pay the majority of the faculty's salaries for many years. For better or for worse (I think mostly for better), most academic departments are not like this one; it's nearly unique.


No. I actually thought the post was satire because of how overtly political the questions and answers are.

edit: The material from his course[0] is all pretty similar. It's hard to believe this guy is a real professor.

0: https://betonit.substack.com/p/my-new-policy-class?utm_sourc...


Having been out of college for a while now, my answers may be a bit out of date. But I've never seen anything quite this toxic. I grew up in liberal area, so I have seen more liberal views. But even then I've never seen anyone not me nor class mates lose points for having a different view they could support. E.g. some argument or evidence.

This reads a lot more like satire even though I don't think it is. Had I not read this in this context I wouldn't have believed something could be this bad.


In the US, this is very common in the higher education Social Sciences.

It is necessary to align with the professor’s (usually left) slant to secure the best marks. This is common knowledge for students in the US.


While I think you're drastically overblowing the issue. I'll agree it's common knowledge in the US there's plenty of college professors who do that kinda BS.

I wonder if there's any correlation between this kind of malfeasance and university quality.


> Republican complaints about education being an avenue for indoctrination now make much more sense to me

The funny thing is that these exam questions are mostly aligned with Republican values. I'm sure it also happens the other way, but this particular case looks more like conservatives indoctrinating students and then turning around and complaining about liberals doing it.


I think most people have already answered this, but there are of course outliers. I did polisci and there were great teachers, and then there were ones who spent the entire class talking about the upcoming election and their views on it and why anyone who disagreed was wrong.

People are people in the end, and while there's weeding out some always slip through.


This is an article by GMU professor of economics Bryan Caplan, published April 3, 2023. Caplan writes:

> Did GPT-4 just get lucky when it retook my last midterm? Does it have more training data than the designers claim? Very likely not, but these doubts inspired me to give GPT-4 my latest undergraduate exam… This is for my all-new Econ 309: Economic Problems and Public Policies class, so zero prior Caplan exams exist.

> The result: GPT-4 gets not only an A, but the high score! This is the real deal. Verily, it is Biblical. For matters like this, I’ve often told my friends, “I’ll believe it when I put my fingers through the holes in his hands.” Now I have done so.


From Bryan's previous post:

>ChatGPT scored poorly on my Fall, 2022 Labor Economics midterm. A D, to be precise. The performance was so poor compared to the hype that I publicly bet Matthew Barnett that no AI would be able to get A’s on 5 out of 6 of my exams by January of 2029. Three months have passed since then.

So he was off by a mere 81 months.


If an LLM scores highly on a test was the ML model smart or the test dumb?


ChatGPT exhibited less than stellar performance when asked to resolve P =? NP, believe it or not!


Depends on the test.


This reminds me a lot of the terrible Intro to Microeconomics course for non-majors I took in college — the tests were mostly about memorizing the professor's most obnoxious opinions.


I think this raises more questions about what we value in the Western education system rather than about the ability of a language model.


Indeed. I think western education is a lot like many jobs: Busy work. The coming LLM/AGI storm will blow all that up.


I suspect LLMs will blow away most of what we call white-collar bullshit jobs, and maybe this is something we should all celebrate (the real issue is with wage labor itself, where we need to do unnecessary tasks in order to just survive!)


"Takes a New Midterm"

I think this is attempting to imply something interesting, but the questions asked in the midterm don't appear to say anything novel or interesting about gpt-4.

1. basic algebra question

2-6. reading comprehension with a well defined answer based on publicly available textbooks or readings

All of the texts needed to answer these questions would have been in GPT-4's training set, and many other tests have already established that it's capable of doing exactly what it did here.


Is this what passes for university these days? Just random personal beliefs without any sourcing? Lol


more an indictment of the midterm in this case, only one question actually required any form of math and it was extremely basic. The rest you could BS your way to a partial score just based on the context of the question. The California and Texas one was a layup


The midterm might not be a great one, but still what software could have received an A five years ago, just given the raw text dumped into it as the input?


I agree, but that doesn't explain why it did better than the rest of the class


The majority of the test is basically kissing the professor’s feet to make him feel his opinions validated. These kinds are tasks are perfect for GPT-4 to generate (whereas for the students, they have a bit more dignity than that…)


'Fun' Scenario:

Some AI company comes along and starts selling AIs for 'educational purposes'. Somewhere in the 100's to 1000's of dollar range, not quite as much as a textbook, but a double digit fraction of that cost. They start training the AI based on the feedback they get from what the students have turned in. So, if a student uses their AI as an 'assist', the company then mandates that the student send back to the AI model all of the comments and grades that were given on the submitted work. So the AI will learn the school, department, and professor's quirks.

Think about that selling point! Who are you going to choose to spend your cash on? The AI 'tutor' that is trained on a generic dataset, or the one that is specifically trained on your professor/TA/teacher/etc ?

Sure, it's all marketing maybe, but then again, maybe it's not. Multiply this over all the students in those 400+ person classes, and you've got a very large addressable market.


Is it bad that I think the reasons for downgrading are worse than the answers?

Allegedly, testing is to ensure that the students comprehend the subject matter and aren’t merely regurgitating what they were taught. “To convince believers that this obviously misguided philanthropic strategy is misguided” is just, like, your opinion, man.


We asked it six questions? Let me field this one to ChatGPT:

> What do you think of a study with a sample size of six?

> A study with a sample size of six may not provide sufficient statistical power to draw meaningful conclusions. With such a small sample size, the results may not be representative of the larger population and could be subject to random variation or bias.

> In general, larger sample sizes are preferred as they can reduce the impact of random variation and increase the likelihood of detecting meaningful effects. The appropriate sample size for a study depends on various factors such as the research question, study design, and expected effect size.

> However, it's important to note that sample size alone is not the only determinant of the quality of a study. Other factors such as study design, data collection methods, and analysis techniques also play important roles in determining the validity and reliability of the results.


An important thing to keep in mind when reading articles about GPT's test-taking abilities is that for humans, the tests are merely proxies. The point of taking the bar exam isn't that we need people to be able to pass the bar exam, it's that (we hope) the only way a human can pass the bar is if they also have all the necessary skills to be a lawyer. The performance on the test is supposed to imply an additional suite of capabilities. But it's not clear that this should be true for GPT. In many cases, it may be possible for a language model to pass the test without having the capabilities the test is intended to establish.

In this case, though, the test is just dumb, so the above is moot.


Was curious what kind of material would show up in his Midterm, so I pulled up the syllabus. One of the text books is "Fossil Future: Why Global Human Flourishing Requires More Oil, Coal, and Natural Gas--Not Less". The jokes about conservatives write themselves.


The questions in this specific midterm, and the readings and expected answers, also have a strong ideological direction.


Yeah, I should have looked at those. Really weird, paint by numbers questions that are just about leaning into the professor's beliefs.


The expected answers and Caplan's insistence on (highly questionable) minute details are a sight to behold.

I wonder what would happen if the prompt for this was expanded to include instructions to respond in the style of Caplan himself.


Unfortunately, if you ask gpt4 to multiply like 87940*670 , it will confidently and consistently give you error .


Luckily you can give it access to a wolfram alpha plugin


That will be easily fixed by roping in the capabilities of something like Wolfram Alpha, which will probably happen quickly.


Has anyone ever proved that GPT can reason at the most basic level the mechanics by definition. Or does it only look like it reasons when it answers like it is reasoning?


People have a hard time understanding what zero-shot, generalization, and memorization mean. Generative models are VERY hard to generate even before we begin to look at LLMs. Let me explain and hopefully we can stop this madness.

Zero Shot:

> Zero-shot learning consists in learning how to recognize new concepts by just having a description of them.[0]

Here's an example of a zero shot task. Suppose a LLM is trained only on text. Then you fine tune for image classification, but those do not include cats (of any type). Then you ask it to classify a picture of a cat. An object it has NEVER SEEN BEFORE.

The community has been pulling a fast one recently. Recent works like Imagen, DALLE, Parti, etc have been claiming a "Zero-Shot MS-COCO". These are 100% bullshit claims. You can go look at images in the COCO dataset[1] and then search them in the LION dataset[2] (CLIP retrieval). You'll see that there are similar images that have the same classes. These models may not have seen the exact same image before, but they've seen plenty of examples. This is NOT zero shot.

Generalization:

Google's developer pages[3] defines generalization as

> Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

The irony being that these "Zero-Shot MS-COCO" results they give are actually good tests for generalization. Assuming the datasets these were trained on were held constant (they aren't) then this would be a great comparison.

Memorization:

This is the "stochastic parrot" stuff. I don't have GPT4 but let's use chat.openai's current 3.5[4]. I asked:

> Which weighs more, a kilogram of feathers or two of bricks?

> A kilogram of feathers weighs the same as two bricks. Weight is a measure of mass, and one kilogram is one kilogram, regardless of the material. So, a kilogram of feathers and two bricks would have the same weight. However, the volume and size of the two objects would likely be very different, as feathers are much less dense than bricks.

Which is an absurd answer. The system memorized the resultant pattern for "Which weights more, a kilogram of feathers or a kilogram of bricks" and responds to any tweaked variation of that as if it was the original version. It is "over-fit". The answer is even slightly more insane than that, because it didn't correctly pick up that "kilogram" is applied to both numbers and just responds as if "two bricks" is "a kilogram of bricks". This is just pattern matching.

This is why some people get really good code answers and others have a really hard time. It depends on what kinds of systems they are coding. Likely strongly correlates with people who view their job closer to "copy paste from stack overflow". Some researchers tested some memorization[9] with Codeforce and looked at the distribution based on the cutoff date.

Back to the convo:

Now that we know what we're talking about and can have a consistent definition of words, we can talk about these things. GPT is neither a pure memorization machine[5] nor "intelligent"[6][7]. It is a language model, which we are having an incredibly difficult time evaluating. We can't have sane conversations about these systems because the hype creates a bimodal distribution of conversations -- oversell, undersell -- and neither are anywhere near accurate. These systems are impressive but we must also be very careful in evaluation.

So the content of the blog? As a generative researcher, I'm not surprised by looking at his questions. The first question has a clear pattern to it that you'd see in an economic class and the author even shows a simple equation: x - y = alpha * z (x,y,alpha provided, solve for z). The second question (Californians moving to Texas) has been written about and you'll find a lot of Google results. So it should be unsurprising that a system trained on a large chunk of the internet can regurgitate a good answer. There's 2 things surprising about this though. 1) from a research perspective, it is quite cool that GPT is creating a weak causal (associative) diagram and can write good conclusions on this. There's sparks of causal reasoning in GPT and that's awesome (See Judea's Twitter feed, he's been playing around)[8]. 2) That neither GPT nor Dr. Caplan noted how California isn't monolithic in political affiliation and that this is actually an unsurprising phenomena if we consider this and could be nuanced at who is moving. It is quite possible that conservative people are moving to Texas because they are they are annoyed by the politics. This has a directly opposite conclusion from what both of them wrote, and has been written about (they are prioritizing politics). But I'll give both a pass because the question is slightly ambiguous in that it is unclear if "Californians who are liberal" are moving or "People from California, which is a liberal state" (does "liberal" apply to the state or the person?).

So there's no zero-shot here. "New" doesn't mean "novel". But why would a midterm be novel? Great way to fuck over your students. Honestly, if GPT couldn't pass the midterm I'd be surprised. But I can tell you how to make GPT fail. Use more math. It still has issues with that. But at the same time I wouldn't be too surprised if it could pass the Physics or Chemistry GREs even if it "never saw it before." Just scraping Reddit would be enough to do decently well. The Math GRE would be impressive though, but only because it is a weak point. There's more than enough info for it to memorize. These tests do not measure intelligence nor even how good of a scientist/researcher you are. They test your ability to memorize and pattern match under stressful conditions. Take the conclusions lightly.

Okay, now that we got that settled, I'm signing back off. Too much to do and you all with the hype are making it harder. The internet makes me too frustrated lately. I just want to build ML systems and it is hard to do with all these strong opinions with low expertise taking center stage. Can we stop with these blogs? They aren't helping. The real danger we're facing with A{G,}I is that we can't even have honest conversations about the danger that these systems do pose. Overselling the danger is just as bad as underselling it. Being an armchair expert isn't helping by being "good enough" it is harmful, especially when you defend your opinion so strongly. I'll tell you the truth, us in the field are still trying to figure all this shit out. If we're having a hard time then don't trust your friend who has just a handful of ML projects. Even a few papers may not be a good enough signal. The system is noisy, lower your trust.

TDLR: be careful with hyped subjects. You're not getting an accurate picture and many people aren't acting in good faith.

[0] https://proceedings.mlr.press/v37/romera-paredes15.html

[1] https://cocodataset.org/#explore

[2] https://rom1504.github.io/clip-retrieval/

[3] https://developers.google.com/machine-learning/crash-course/...

[4] https://chat.openai.com/chat

[5] https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

[6] https://arxiv.org/abs/2303.12712

[7] Intelligence is hard to define and we won't try to here, so the quotes. (Is an ant intelligent? I would say "yes", but I also understand a "no" answer) But the point is that people are over-selling the intelligence.

[8] https://twitter.com/yudapearl

[9] (Twitter now marks this as an unsafe website. Good going Elon...) https://aisnakeoil.substack.com/p/gpt-4-and-professional-ben...


An excellent response. I learned more from your post than from the article, and I suspect I learned more here than I would from the professor's course, if (as seems reasonable to assume) this midterm is a reliable indicator of its depth.


Why is this newsworthy?


Mind = blown

No one could’ve ever predicted this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: