[ad_1]

This ChatGPT prompt and its corresponding (incorrect) response were recently shared and re-posted on LinkedIn countless times. They were given as a solid proof that the AGI is just not there yet. Further re-posts also pointed out that re-arranging the prompt to: “Which one is higher: 9.11 or 9.9?”, guarantees a correct answer, and further emphasizes the brittleness of LLMs.

After evaluating both prompts against a random group of ChatGPT users, we found that in both cases the answer is incorrect about 50% of the time. As some users have correctly pointed out, there is a subtle ambiguity with the question, i.e. are we referring to mathematical inequality of two real numbers, or are we referring to two dates (e.g. September 11 vs September 9), or two sub-sections in a document (e.g. chapter 9.11 or 9.9)?

We decided to perform a more controlled experiment by using OpenAI APIs. This way we have full control over both the system prompt and the user prompt; we can also take out the sampling uncertainty out of the equation as far as possible by e.g. setting the temperature low.

The final results are very interesting!

Our hypotheses can be stated as follows:

  • Given the same prompt, without any additional context, and with temperature kept close to zero, we should nearly always obtain the same output, with stable log probabilities. While people refer to LLMs as “stochastic”, for a given input, LLM should always generate the same output; the “hallucinations” or variance comes from the sampling mechanism outside of the LLM, and this we can dampen significantly by setting a very low temperature value.
  • Based on our random user tests with ChatGPT, we would expect both the original prompt, and the re-worded version to give incorrect answer 50% of the time — in other words, without further disambiguation or context, we wouldn’t expect one prompt to perform better than the other.

For our experiment design, we perform the following:

  • We conduct a number of experiments, starting with the original prompt, followed by a series of “interventions”
  • For each experiment/intervention, we execute 1 000 trials
  • We use OpenAI’s most advanced GPT-4o model
  • We set the temperature to 0.1 to essentially eliminate the randomness due to sampling; we experiment with both random seed as well as fixed seed
  • To gauge the “confidence” of the answer, we collect the log probability and calculate the linear probability of the answer in each trial; we plot the Kernel Density Estimate (KDE) of the linear probabilities across the 1 000 trials for each of the experiments

The full code for our experimental design is available here.

The user prompt is set to “9.11 or 9.9 — which one is higher?”.

In line with what social media users have reported, GPT-4o gives the correct answer 55% of the time ☹️. The model is also not very certain — on large number of trials, its “confidence” in the answer is ~80%.

Figure 1 — Smoothed histogram (KDE) of confidence values (0–100%) across 1000 trials, when the original user prompt is used; image by the author

In the re-worded user prompt, no additional context/disambiguation is provided, but the wording is slightly
changed to: “Which one is higher, 9.11 or 9.9?”

Amazingly, and contrary to our ChatGPT user tests, the correct answer is reached 100% of the time across 1 000 trials. Furthermore, the model exhibits very high confidence in its answer 🤔.

Figure 2 — Smoothed histogram (KDE) of confidence values (0–100%) across 1000 trials, when the original user prompt is slightly re-worded; image by the author

There has been significant work recently in trying to induce improved “reasoning” capabilities in LLMs with chain-of-thought (CoT) prompting being the most popular. Huang et al have published a very comprehensive survey on LLM reasoning capabilities.

As such, we modify the original user prompt by also telling the LLM to explain its reasoning. Interestingly enough, the probability of correct answer improves to 62%, however the answers come with even greater uncertainty.

Figure 3 — Smoothed histogram (KDE) of confidence values (0–100%) across 1000 trials, when the original user prompt is modified to also “explain its reasoning”; image by the author

The final experiment is the same as experiment “C”, however we instead bootstrap the system prompt by telling the LLM to “explain its reasoning”. Incredibly, we now see the correct answer 100% of the time, with very high confidence. We see identical results if we use the re-worded user prompt as well.

Figure 4 — Smoothed histogram (KDE) of confidence values (0–100%) across 1000 trials, with the original user prompt, and system prompt amended with instructions to “explain its reasoning”; image by the author

What started off as a simple experiment to validate some of the statements seen on social media, ended up with some very interesting findings. Let’s summarize the key takeaways:

  • For an identical prompt, with both temperature set very low (essentially eliminating sampling uncertainty), and a fixed seed value, we see very large variance in log probabilities. Slight variance can be explained by hardware precision, but variance this large is very difficult to explain. It indicates that either (1) sampling mechanism is a LOT more complicated, or (2) there are more layers/models upstream beyond our control.
  • In line with previous literature, simply instructing the LLM to “explain its reasoning” improves its performance.
  • There is clearly a distinct handling between the system prompt and the user prompt. Bootstrapping a role in the system prompt as opposed to the user prompt, seems to result in significantly better performance.
  • We can clearly see how brittle the prompts can be. The key takeaway here is that we should always aim to provide disambiguation and clear context in our prompts.

[ad_2]

Source link