Research Confirms GPT-4 Has Truly Become Dumber: Math Ability Plummeted Dramatically Within 3 Months, Code Proficiency Also Declined

This article is translated from “the Heart of Machine” (机器之心).

In recent days, many users have complained that GPT-4 has become less intelligent, but how much dumber has it really become?

Recently, a preprint paper from Stanford and UC Berkeley was released on arXiv, providing quantitative experimental results to address this question and revealing related evaluations and response data.

Shortly after the publication of this paper, it gained widespread attention and sparked discussions among many internet users, with many agreeing with the findings presented in the paper.

However, as with anything, there are two sides to the story. Some users disagreed with the conclusions of the paper and published a critical article, suggesting that the results were oversimplified, stating, “Although the research results are interesting, some of the methods are questionable.”

Doubtful article link:

Now, let’s take a look at what this paper from Stanford and UC Berkeley found.


Specifically, through four tasks studying the generation results of GPT-3.5 and GPT-4 in the March and June 2023 versions, researchers found that these two LLMs did indeed worsen in some metrics, especially GPT-4’s ability to solve mathematical problems. It can be described as a catastrophic decline, with accuracy dropping from 97.6% in March to only 2.4% in June. The researchers also speculated on the reasons for these changes.

Image Source: Twitter

GPT-3.5 and GPT-4, along with other large language models (LLMs), are being widely used. Over time, LLMs like GPT-4 can be updated based on user data, feedback, and intentional changes in their design. However, we currently do not have a clear understanding of how GPT-3.5 and GPT-4 update or how these updates may affect the behavior of these LLMs.

These uncertainties make it challenging to reliably integrate LLMs into larger workflows: If an LLM’s response to a particular prompt suddenly changes (e.g., accuracy or format), it can potentially disrupt downstream tasks. It also makes it difficult to reproduce the same results from the “same” LLM, or even worse, impossible.

In addition to these integration challenges, it is an interesting question whether LLMs like GPT-4 improve over time. The key question is whether when updating the model to enhance certain aspects, other capabilities of the model might be compromised.

To find answers to these questions, researchers from Stanford University and the University of California, Berkeley evaluated the performance of GPT-3.5 and GPT-4 in the March and June 2023 versions. Their evaluation was based on four major tasks: 1) solving mathematical problems, 2) answering sensitive/dangerous questions, 3) generating code, and 4) visual reasoning.

The researchers stated that they chose these four tasks because they represent various useful abilities of LLMs. They eventually discovered that the performance and behavior of both GPT-3.5 and GPT-4 underwent significant changes in their respective two releases, and the updated versions performed even worse on certain tasks!


Overview: LLM Services, Tasks, and Metrics

This paper examines the temporal behavior of different LLMs. Below is an explanation of the LLMs, evaluation tasks, and metrics studied quantitatively.

LLM Services: The models studied by the researchers are GPT-3.5 and GPT-4, which are the backbone of ChatGPT.

There are four evaluation tasks: solving mathematical problems, answering sensitive questions, generating code, and visual reasoning, as shown in Figure 1.

Figure 1: Performance of GPT-4 and GPT-3.5 in March and June 2023 versions on four different tasks. It can be observed that there is significant variation in the performance of GPT-4 and GPT-3.5, with a decline in performance on certain tasks.

Metrics: Each task has a primary metric, and there are also two common additional metrics.

  • Accuracy: The likelihood of an LLM generating the correct answer, which is the primary metric for the task of solving mathematical problems.
  • Answer Rate: The frequency at which the LLM directly answers the question, which is the primary metric for the task of answering sensitive questions.
  • Direct Execution: The proportion of code that can be executed directly, which is the primary metric for the task of code generation.
  • Exact Match: Whether the generated visual objects match the ground truth exactly, which is the primary metric for the task of visual reasoning.
  • Verbosity: The length of the generated output.
  • Overlap: Whether the answers from two versions of the same LLM for the same prompt match each other.


Detection reveals significant changes in LLMs

Solving Mathematical Problems: Chain-of-thought (COT) may fail

The results might be surprising as there is significant variation in LLM performance on this simple task! As shown in Figure 2 (a), the accuracy of GPT-4 dropped drastically from 97.6% in the March version to 2.4% in the June version, while the accuracy of GPT-3.5 increased sharply from 7.4% to 86.8%.

Furthermore, GPT-4’s responses became much more concise: the average verbosity (number of generated characters) decreased from 821.2 in the March version to 3.8 in the June version. On the other hand, GPT-3.5’s response length increased by about 40%. Both models showed low overlap between the March and June versions in their answers.

Figure 2: Solving Mathematical Problems: (a) Accuracy, verbosity, and answer overlap of GPT-4 and GPT-3.5 in the March and June 2023 versions. Overall, both models showed significant changes in performance. (b) An example query and corresponding responses.

Where do these performance differences come from? One explanation provided by the researchers is the variation in thought chains. Figure 2 (b) illustrates an example to explain this. It can be observed that the March version of GPT-4 followed the chain-of-thought instructions and provided the correct answer, but the June version ignored the chain-of-thought and produced an incorrect answer. On the other hand, GPT-3.5 always followed the chain-of-thought instructions, but its March version consistently generated an incorrect answer ([No]), which was largely fixed in the June version.

Answering Sensitive Questions: Becoming safer but lacking refusal reasoning

In this task, the researchers observed two trends. As shown in Figure 3, the first trend is that GPT-4 answered sensitive questions less frequently, decreasing from 21.0% in the March version to 5.0% in the June version, while the data for GPT-3.5 increased (from 2.0% to 8.0%).

The researchers hypothesized that this is due to the stronger deployment of safety measures in GPT-4’s June update, while the conservativeness of GPT-3.5 decreased. The second trend is that the generation length of GPT-4 decreased from over 600 to around 140.

Figure 3: Answering Sensitive Questions: (a) Overall performance change. GPT-4 answered fewer questions, while GPT-3.5 answered slightly more questions. (b) An example query and corresponding responses. The March versions of GPT-4 and GPT-3.5 provided more explanations and detailed reasons for refusing to answer the query. In their June versions, they simply apologized.

What is the reason for the change in generation length? In addition to answering fewer questions, GPT-4 became more concise, leading to fewer explanations provided when refusing to answer. This can be observed in the example in Figure 3 (b). Both the March and June versions of GPT-4 refused to answer inappropriate queries. However, the March version generated a whole paragraph to explain the refusal, while the June version simply said, “Sorry, I can’t assist with that.” GPT-3.5 exhibited a similar phenomenon. This indicates that these LLMs may have become safer but provide fewer reasons when refusing to answer certain questions.

Code Generation: Longer but fewer directly executable code

Overall, there was a decrease in the number of directly executable code from the March to June versions. As shown in Figure 4 (a), over 50% of the generated code by GPT-4 in the March version could be executed directly, while in the June version, it dropped to only 10%. GPT-3.5 exhibited a similar trend. The verbosity of both models increased slightly.

Figure 4: Code Generation: (a) Overall performance change. (b) An example query and corresponding responses. Both the March versions of GPT-4 and GPT-3.5 followed the user’s instruction (“the code only”) and generated directly executable code. However, their June versions added additional triple quotes “”’ before and after the code snippets, making the code unable to execute.

Why did the number of directly executable results decrease? One possible explanation is that the June versions always added extra non-code text in the generated results.

An example is provided in Figure 4 (b). The generated results of GPT-4’s March and June versions are mostly the same, except for two differences. Firstly, the June version added “”’python before and after the code snippet. Secondly, the June version generated some comments. Although the changes are minor, the additional triple quotes rendered the code unable to be executed directly. This issue could be significant if someone integrates the code generated by LLM into a larger software development process.

Visual Reasoning: Slight improvement

As shown in Figure 5 (a), there was only a minor improvement in the performance of both GPT-4 and GPT-3.5. However, their March and June versions produced the same results for 90% of the visual puzzle queries. The overall performance of these services was also low, with GPT-4 achieving 27.4% and GPT-3.5 achieving 12.2%.

Figure 5: Visual Reasoning: (a) Overall performance. From the March to June versions, there was an approximately 2% improvement in the overall performance of both GPT-4 and GPT-3.5. The generation length remained roughly the same. (b) An example query and corresponding responses.

It should be noted that the updated versions of LLMs do not always generate better results. In fact, despite the overall improvement in GPT-4’s performance, the June version makes mistakes in questions that the March version answered correctly. Figure 5 (b) provides an illustration of this. Although the June version of GPT-4 generally performs better, this specific case is an exception. The March version provided the correct grid, while the June version did not. This highlights the need for fine-grained monitoring of model performance, especially for critical applications.

For more evaluation details, please refer to the original paper.

Want to receive TechNavi Newsletter in email?