Research
5 minute read

Lightweight tools for ‘steering’ LLMs down the right path

IBM’s new AI Steerability 360 toolkit lets you build and test pipelines for customizing LLM generation.

The more sophisticated that large language models become, the more difficult controlling them can be. Newer models have guardrails to limit the worst kinds of behavior, but guardrails are rarely enough to guarantee that LLMs will do what you want them to do all the time. For enterprise users, that’s a nonstarter.

IBM’s new AI Steerability 360 toolkit is designed to put users in the driver’s seat. The toolkit provides algorithms that can be turned like dials to tweak an LLM’s outputs at different stages in the generative process. It also provides a framework for combining these algorithms into steering ‘pipelines’ that can be systematically evaluated on a given task.

AISteer360 is the third set of technologies developed by IBM Research to help software developers and researchers map, measure, and manage the risks associated with generative AI. IBM’s AI Risk Atlas Nexus, released this summer, maps existing and emerging AI threats, while its In-Context Explainability 360 toolkit helps to expose the ‘thinking’ that goes into generated outputs. AISteer360 rounds out the trio by giving users tools to steer LLMs away from unwanted, and potentially unsafe, behavior before they output a word.

Alignment used to be the catch-all term for the adjustments made to LLMs after pre-training to improve their safety and performance. Alignment is still the goal, but today hundreds of lightweight, low-cost ‘steering’ methods have emerged as alternatives to the time and expense of fine-tuning the model with new data and updating its weights.

AISteer360 offers a new approach to model steering. It organizes leading algorithms into four categories and provides a way to combine them, like snapping together LEGO pieces, to customize LLMs in several ways at once.

“Many users want broad control over how the model answers, but also more nuanced control over the style and formatting of its writing,” said Erik Miehling, a staff research scientist at IBM who led the team behind AISteer360. “The toolkit’s steering pipeline makes this easy. You can define a topic-level content filter through activation steering and make style changes through reward-driven decoding — all in a single operation on the model.”

Pressure points

The algorithms implemented in AISteer360 target four key points in the generative workflow: the prompt, which defines what goes into the model; the model’s weights, which encode its learned behaviors; the model’s internal states, which shape how it processes information at runtime; and the decoding stage, which governs how outputs are selected and expressed.

“Depending on what you want to do, you have four different pressure points you can push on to nudge the model in the direction you want it to go,” said Pierre Dognin, an IBM researcher who helped build AISteer360.

Of the four steering methods, prompting is perhaps the best known. You can change the model’s response simply by rewording a question, providing a few examples, or asking the model to role play. The toolkit contains a “few-shot” prompting algorithm for input control.

“You’re showing the model what to do without necessarily explaining why it’s desirable,” said Miehling.

There’s only so much you can accomplish, however, by fiddling with prompts. More lasting changes can be made by altering the model’s weights, or structure.

This can be done by fine-tuning the model on curated data that captures a desired skill or behavior, or through direct preference optimization (DPO) and related methods, which reinforce desired behavior by having people, and even AI itself, express their preferences for specific model outputs. Specialized models can also be merged into one without retraining from scratch.

These types of structural adjustments make up AISteer360’s second class of controls targeting the model’s weights. For easy integration, wrappers are provided for Hugging Face’s Transformer Reinforcement Learning (TRL) library and Arcee AI’s MergeKit. These wrappers make it easy for other methods, including IBM’s lightweight aLoRA adapters, accessible from Hugging Face’s PEFT library, to easily be brought in.

A third class of controls target the model’s internal states. Through activation steering, a latent numerical representation of a desired behavior can be extracted from the model and used to change its hidden state, which is a kind of record of everything the model has processed so far. By manipulating the model’s hidden state, the model can be steered to write in specific style or avoid certain topics.

Researchers included two state-control methods in the toolkit: PASTA, which reweights the model’s attention at inference time so that it follows user-defined rules, and CAST, an IBM solution that introduces a condition vector for more selective, nuanced outputs.

Using CAST, you can tell the model to decline a prompt if it contains hate speech or adult content, or if it hits on topic other than the one you’ve specified. “LLMs see so much data during pre-training that you can actually extract what you're looking for in the latent space itself,” said Inkit Padhi, an IBM researcher who helped develop CAST with then-IBM intern Bruce W. Lee, who is now a senior at the University of Pennsylvania.

You can define a topic-level content filter through activation steering and make style changes through reward-driven decoding — all in a single operation on the model.

A fourth set of controls can be applied during decoding to directly influence what comes out of the model. Researchers included four methods in the toolkit: reward-augmented decoding (RAD), decoding-time alignment (DeAL), Thinking Intervention, which edits the model’s “thoughts” mid-generation to produce safer or more aligned outputs, and IBM’s own solution, self-disciplined autoregressive sampling (SASA), which steers the model’s output, token-by-token, away from toxic language or other unwanted outputs. This self-censoring mechanism could be useful in enterprise settings where maintaining a safe, polite tone is essential.

Comparing steering controls

The steering taxonomy created for AISteer360 — input, structure, state, and output controls — is the foundation for the toolkit’s build-your-own pipeline capability. It allows researchers and developers to assemble multiple steering methods in one workflow as they would in building a classical machine learning pipeline.

“Instead of transforming data, the steering pipelines transforms generation by injecting guidance at different points in the process,” said Miehling. “This modular approach lets you compare how different steering strategies interact and shape the final output.”

AISteer360 also has a benchmarking capability which gives users a way to compare their steering method to other methods on a common task.

To show what’s possible, the team created two benchmarks, one for common-sense question answering and the other for how well a model follows instructions. Performance on the benchmarks is measured by criteria that users define, whether it’s factual accuracy, or the model’s rate at following instructions. This helps people understand how different methods influence different behaviors, including those not explicitly targeted.

“If a steering method helps to detoxify model behavior but reduces its fluency, users should be aware of the side-effect,” said Miehling.

Evaluation works like this: because the same steering data can be used with different controls, you can simply apply it in both cases and compare the results. The examples given to the model in a few-shot prompt can also be recycled as a DPO preference dataset, providing an apples-to-apples comparison. “This lets you see how much ‘juice’ a given method can squeeze out of a dataset,” said Miehling.

The team is continuing to build out the toolkit and is looking to the open-source community to bring in their own steering methods and tasks. As more enterprises incorporate generative AI into mission-critical tasks, the need to control model outputs has never been greater.

The team’s comprehensive documentation page provides an overview of steering concepts as well as guided tutorials for building steering controls and pipelines for anyone looking to implement them today.

Related posts