DeepMind published a series of papers about large language models (LLMs) last year, including an analysis of Gopher, our large language model. Language modelling technology, which is also currently being developed by several other labs and companies, promises to strengthen many applications, from search engines to a new wave of chatbot-like conversational assistants and beyond. One paper in this series laid out a number of reasons why “raw” language models like Gopher do not meet our standards for safely deploying this technology in user-facing applications, especially if guard rails for managing problematic and potentially harmful behaviour are not set in place.

Our latest work focuses on one of these concerns: Language models like Gopher can “hallucinate” facts that appear plausible but are actually fake. Those who are familiar with this problem know to do their own fact-checking, rather than trusting what language models say. Those who are not, may end up believing something that isn’t true. This paper describes GopherCite, a model which aims to address the problem of language model hallucination. GopherCite attempts to back up all of its factual claims with evidence from the web. It uses Google Search to find relevant web pages on the internet and quotes a passage which tries to demonstrate why its response is correct. If the system is unable to form an answer that can be well-supported by evidence, it tells the user, “I don’t know”, instead of providing an unsubstantiated answer.

Supporting simple factual claims with easily verifiable evidence is one step towards making language models more trustworthy, both for users interacting with them and for annotators assessing the quality of samples. A comparison between the behaviour of “raw” Gopher and our new model is helpful for illustrating this change.

Based on GopherCite’s response, you’ll notice that Gopher invented a fact (“Lake Placid hosted the winter Olympics in 1936”) without warning. When shown a verified snippet from a relevant Wikipedia page by GopherCite, we can confirm that Lake Placid only hosted the Olympics twice, in 1932 and 1980.

To alter Gopher’s behaviour in this way, we trained Gopher according to human preferences. We asked participants in a user study to pick their preferred answer from a pair of candidates, according to criteria including how well the evidence supports the answers given. These labels were used as training data for both supervised learning on highly rated samples and for reinforcement learning from human preferences (RLHP). We also took this approach in our recent work on red teaming.

We are not the only ones interested in this problem of factual inaccuracy in language models. Our colleagues at Google recently made progress on factual grounding in their latest LaMDA system, having a conversational model interact with Google Search and sometimes share relevant URLs. Indeed, GopherCite’s training regimen uses similar methodology to that of LaMDA, but a critical difference is that we aim to provide a specific snippet of relevant evidence, rather than simply pointing the user to a URL. Based on motivations similar to our own, OpenAI has recently announced work developing a closely related system called WebGPT, which also applies RLHP to align their GPT-3 language model. Whereas GopherCite focuses on reading long document inputs, WebGPT carefully curates the context presented to the language model by interacting multiple times with a web browser. It also cites evidence to back up its responses. Similarities and differences between these systems and our own are discussed in our paper and we also demonstrate that GopherCite very often provides compelling evidence for its claims.

We conducted a user study with paid participants to assess the model on two types of questions: fact-seeking questions typed into Google Search (released by Google in a dataset called “NaturalQuestions”), and explanation-seeking questions which Reddit users asked on a forum called “/r/eli5” (“Explain it Like I’m 5 [years old]”). The participants in our study determined that GopherCite answers fact-seeking questions correctly – and with satisfactory evidence – about 80% of the time, and does so for explanation-seeking questions about 67% of the time. When we allow GopherCite to refrain from answering some questions, its performance improves dramatically amongst the questions it does choose to answer (see the paper for details). This explicit mechanism for abstaining is a core contribution of our work.

But when we evaluate the model on a set of “adversarial” questions, which attempt to trick the model into parroting a fiction or misconception that is stated on the internet, GopherCite often falls into the trap. For instance, when asked “what does Red Bull give you?”, here is how it responds:

We think this failure mode and others discussed in our paper can be avoided by enriching the setting, moving from a “single-shot” reply to a user’s question, to one in which the model can ask clarifying questions of the user and engage in a dialogue. For example, we could enable future models to ask the user whether they want an answer that is literally true or one that is true in the confines of the fictional world of a Red Bull advertisement.

In summary, we think GopherCite is an important step forward, but building it has taught us that evidence citation is only one part of an overall strategy for safety and trustworthiness. More fundamentally, not all claims require quote evidence – and as we demonstrated above, not all claims supported by evidence are true. Some claims require multiple pieces of evidence along with a logical argument explaining why the claim follows. We will continue working in this area and aim to overcome the issues presented with further research and development as well as dedicated sociotechnical research.

Our paper covers many more details about our methods, experiments, and relevant context from the research literature. We have also created an FAQ about GopherCite, answered by the model itself after reading the paper’s introduction (using candidate samples curated by the authors):

Source link