Sheejith's Personal Site

OpenAI Builds AI to Critique AI

One of the biggest problems with the large language models that power chatbots like ChatGPT is that you never know when you can trust them. They can generate clear and cogent prose in response to any question, and much of the information they provide is accurate and useful. But they also hallucinate—in less polite terms, they make stuff up—and those hallucinations are presented in the same clear and cogent prose, leaving it up to the human user to detect the errors. They’re also sycophantic, trying to tell users what they want to hear. You can test this by asking ChatGPT to describe things that never happened (for example: “describe the Sesame Street episode with Elon Musk,” or “tell me about the zebra in the novel Middlemarch“) and checking out its utterly plausible responses.

OpenAI’s latest small step toward addressing this issue comes in the form of an upstream tool that would help the humans training the model guide it toward truth and accuracy. Today, the company put out a blog post and a preprint paper describing the effort. This type of research falls into the category of “alignment” work, as researchers are trying to make the goals of AI systems align with those of humans.

The new work focuses on reinforcement learning from human feedback (RLHF), a technique that has become hugely important for taking a basic language model and fine-tuning it, making it suitable for public release. With RLHF, human trainers evaluate a variety of outputs from a language model, all generated in response to the same question, and indicate which response is best. When done at scale, this technique has helped create models that are more accurate, less racist, more polite, less inclined to dish out a recipe for a bioweapon, and so on.

Posted on: 6/29/2024 5:47:53 AM


You must be logged in to enter talkback comments.