Gaetano Rossiello, Shankar Subramaniam
ACM CAIS 2026
If multiple artists are asked to draw a circle by hand, each one will produce something slightly imperfect. Yet, the average of their sketches can look strikingly close to ideal. We investigate whether knowledge from different models can be combined in the same way. We propose to average models based on their \textit{kernel}: the matrix of all dot products between model embeddings per data sample. Compared to techniques such as weight-averaging, this has the advantage of allowing merging between models that have been trained separately, i.e., from different initializations. We take models that have been trained on disjoint, skewed sets of data and show that simple averaging produces a kernel that trends representationally towards that of a more accurate model. Empirically, we even find the similarity landscape with respect to teacher kernels to be convex. We then use a differentiable version of Mutual -Nearest Neighbors (MKNN), to directly optimize a student network for representational similarity with the average kernel. We find that this provides consistent gains in performance. These findings open the door for a new type of model-merging that does not rely on weight-averaging, and is thus able to accommodate models that are trained from scratch independently. Going further, they hint at a more general framing for model-merging techniques, in which models can be thought to lie in the same loss basin with respect to their representations.
Gaetano Rossiello, Shankar Subramaniam
ACM CAIS 2026
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Yannis Belkhiter, Seshu Tirupathi, et al.
ICML 2026