The AI Report #5: GPT-4 aces MIT.. or not

Jun 19, 2023

Hello and welcome to the fifth edition of the AI Report. We aim to inform you with trends in the world of AI— from research to products to everyday life — and throw light on the latest trends. Please subscribe for actionable insights and share this newsletter on social media.

gray concrete dome building at daytime — MIT Campus. Photo by Muzammil Soorma on Unsplash

Trends in AI

The MIT Fiasco

Authors from MIT released a paper that claimed that GPT-4 can ace EECS and Math exams. In fact, they claimed it can score 100%.

Except, there were numerous issues with the paper. Apparently, they allowed the model to answer questions until it got it right, and also the evaluation was done by GPT-4 itself.

This Twitter thread lays down what happened, pretty well.

Useful nuggets below:

The authors evaluation uses GPT 4 to score itself, and continues to prompt over and over until the correct answer is reached. This is analogous to someone with the answer sheet telling the student if they’ve gotten the answer right until they do.
..
In our analysis of the few-shot prompts, we found significant leakage and duplication in the uploaded dataset, such that full answers were being provided directly to GPT 4 within the prompt for it to parrot out as its own.

The takeaway, for many in Twitter, has been that the research is sloppy. That’s true, but it is useful to understand what could have caused this to happen which ties to existing trends in the LLM space.

Prompt, prompt until you succeed

The ML community has produced A LOT of techniques now to do prompting: chain of thought, tree-of-thought, critique (e.g. reflexion), expert prompting etc. (all of which the paper uses). Critique, for example, applies a prompt and then when the answer is wrong, uses some form of reflection or self-feedback to improve the answer. The authors took this approach to the extreme. They do the prompting in a cascade. From the paper:

We apply few-shot, chain-of-thought, self-critique, and expert prompting as a cascade. Since grading is performed automatically, we apply each method to the questions that the previous methods do not solve perfectly.

Such cascading methods are fine but of course you need to make sure your evaluation is correct.

Using GPT-4 to evaluate GPT-4

There has been a trend to use LLMs to evaluate other LLMs. This might be fine in many cases (e.g. if you are evaluating GPT3 with GPT4), but the authors ended up using the same LLM (GPT4) to solve a problem but also grade it.

Note that this is different from many other works which use the same model to critique the work, since they still are able to measure the performance against some form of ground truth data.

Information Leakage

The way they use few-shot prompting is that, for a given question, they find a similar question and feed to the LLM the similar question and its answer in the prompt. You see the problem? In many cases, the similar question turns out to be an identical question and so the model has access to the solution.

Parting Thoughts

The community did a swift job of figuring out these problems. Especially, thanks to great analysis done by MIT undergrads which you can read here.

But at the same time, some of the reactions have been less than professional. For what it is worth, the paper is useful in telling us what not to do. The paper exposes glaring issues with the following:

How we evaluate LLMs and their outputs
The perverse incentives in academia to publish.
Papers getting shared around without getting vetted, thanks to the current LLM hype cycle.

Let’s just hope we can have civil discussions around such important issues in the future.

Papers in AI

📝 [Paper] Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Another impressive paper from Meta. This paper presents Voicebox, a non-autoregressive model for text-guided speech synthesis. They can do style transfer and some fancy stuff like editing voice based on text (e.g. you can replace a word with a different word in the speech).

Project page with video: here

Read the paper here.

Apparently the models are too powerful to be open-sourced, Meta claims.

There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time. While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility.

📝 [Paper] FinGPT: Open-Source Financial Large Language Models

The authors present an open-source equivalent to BloombergGPT.

This is great news, and so let’s wait and see what applications this research brings us.