How to make AI models more accurate: Embrace failure

Fine-tuning on negative data produced a model well suited for chemistry, where successful experiments are rare. Failed experiments make chemical language models more accurate.

A negative result doesn’t have to be a bad thing. In fact, a failed experiment can be as informative as a successful one. Whether a chemistry experiment yields an unexpected product or no product at all, it offers new insights into the conditions that successful reactions require. And IBM Research scientists have shown that’s also true for AI models. A team of researchers trained a language model to predict chemical reactions, and it turned out it was more accurate when they tuned it on data from a mix of successful and unsuccessful chemistry experiments than it was when tuned on only successful experiments.

The scientific community is embracing negative findings more than it was even a decade ago, and the transformer revolution is spotlighting a use for some of that data. A new paper by the IBM Research team was published today in Science Advances.¹ It describes a transformer-based language model trained on data from “successful” experiments, and fine-tuned through a reinforcement-learning framework, leveraging data from at least 40 times as many “unsuccessful” or “negative” experiments.

“You can think of chemistry as having a grammar and syntactic rules,” said IBM Research scientist Mara Graziani, principal investigator on the new study. Language models can learn these rules, and models can be trained to understand the rules and can perform operations based on them, Graziani said.

Learning from different types of failure

Researchers used two different classes of unsuccessful experiments in this study: those that yielded an unexpected but chemically relevant product, and those that included no significant insights. Graziani and her colleagues believe both types of negative reactions contain information that can be leveraged to better understand the domain-specific language of chemistry.

blog-llm negative feedback-chart2.png — The first class of negative reaction does yield a product, but it's not the one that was hypothesized.

“It’s easier to learn a language by trial and error, instead of just by repeating correct sentences,” she said. In the domain of chemistry, as with linguistic errors, these negatives aren’t random — failure is informative because each attempt is based on background knowledge and an educated hypothesis. Both classes of negative reaction give feedback on how to build an eventual successful experiment.

Graziani and her colleagues built on IBM’s pioneering work in applying transformer-based language models to chemical language processing, training their own model with chemical reactions extracted from United States Patent and Trademark Office (USPTO) patents. Its language-modeling core uses the very transformer backbone that has since been scaled up to power state-of-the-art large language models, including IBM’s Granite series. They fine-tuned the model on two chemistry datasets — one with more than 500 well-characterized electrophilic aromatic substitution reactions but no negative data, and one with real-world results including negative data. Negative data could be easily generated for the first set, though, because there are only a few possible unexpected yields for that reaction type.

“Since long before today’s surge in large language models, at IBM we have been pioneers in the use of language models (transformer architectures) for scientific applications,” IBM Research scientist Teodoro Laino added. “Our 2019 paper, one of the first to apply language models in science, became the springboard for the reinforcement learning techniques we now use to extract insights even from unsuccessful experiments.”²

blog-llm negative feedback-chart1.png — The second class of negative reaction leaves the starting materials unreacted.

Slim pickings

Model fine-tuning was meticulous for the research group, who crafted reward functions to support the use of reinforcement learning from human feedback (or RLHF). This approach is common in machine vision and natural language processing tasks, but not in chemistry. The key was building the reward function in a way that made sense. The friction in working with many negative samples and very few positives is that you're working with a very sparse space from which to get rewards — a hard task for reinforcement learning.

In typical RLHF cases, there is plenty of positive data to teach a model how to identify and predict desirable or undesirable outcomes. But again, not so in the chemistry lab, which has very few breakthroughs amid all the misses. Success came down to the use of a vectorial representations of chemical reactions in a latent space, which was originally optimized for token prediction, and that was then further tuned to discriminate the positives among this large number of negative outcomes.

Re-encoding the latent space to embed the successful reactions closer to each other made it possible to classify positives against negatives. In this new representational space, that optimal margin turns the task into simple boundary separation between successful and unsuccessful reactions.

“Eventually this was the trick,” Graziani said. So during training time, with a generative model predicting new reactions unseen in the data, the fine-tuned model could distinguish where it should be embedded and could inform the guess of whether the predicted reaction could have worked.

Compared to a model trained on USPTO data but not fine-tuned on negative data, the fine-tuned experimental model performed over 10% better at predicting successful reactions. In the test dataset, all possible positives have been described, so it was clear when the model predicted a successful reaction.

Collecting negative data

One of the challenges with this line of research, said Graziani, is not in persuading chemists of the value of negative reaction data (in fact, numerous perspective articles already champion its importance) but in the scarcity of venues willing to publish it. Aside from a few publicly curated datasets, today’s publishing ecosystem doesn’t offer venues for reporting failed experiments, so models are still starved of the data they need.

This problem is rooted in the incentives of academic publishing and career progression, which prize monumental individual contributions over methodical team efforts. The 2015 reproducibility crisis in the social sciences cast new light on the importance of replications in the scientific literature, but it’s still rare to see a null result splashed across the front page of an academic journal. This new model shines in this landscape.

“In experimental case where you might have an abundance of negatives but very few positive samples,” said Graziani, “Our approach is winning in the sense that it unlocks the learning mechanisms that would otherwise remain stuck in simplified tuning.”

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter