SemCLIP: A Semantic Memory-Aligned Vision Language Model
Tanveer Syeda-Mahmood, Niharika DSouza, et al.
NeurIPS 2025
Modern computer vision models commonly rely on passive sensing and process images in their entirety all at once.
Lacking the ability to zoom-in to task-relevant regions for detailed analysis, this approach becomes limited for high-resolution, cluttered scenes where only a small area is relevant for the task at hand.
A particularly challenging problem in this context is instance detection that involves localizing specific object instances given a few visual examples.
We introduce an active sensing system that uses a brain-inspired coarse-to-fine strategy to glimpse over the image by steering a retina-like sensor.
The sensor uses a log-polar pixel layout that facilitates precise localization of task-relevant regions.
Our system can be integrated with various state-of-the-art instance detectors. It improves their performance by up to 90%, making even small models developed for edge-devices perform on par or, in difficult cases, even better than their large counterparts.
In light of performance gains, our model can become a complementary part in sensor hardware enabling active, task-driven sensing.
Tanveer Syeda-Mahmood, Niharika DSouza, et al.
NeurIPS 2025
Sarath Swaminathan, Nathaniel Park, et al.
NeurIPS 2025
Giovanni De Felice, Arianna Casanova Flores, et al.
NeurIPS 2025
Ramon Nartallo-kaluarachchi, Robert Manson Sawko, et al.
NeurIPS 2025