Saurabh Paul, Christos Boutsidis, et al.
JMLR
This paper addresses few-shot semantic segmentation (FSS) guided by text, where we classify unseen novel classes using image and text references as in-context examples, without the need for training. We enhance the quality and stability of the segmentation masks generated by FSS by combining the capability of open-vocabulary zero-shot semantic segmentation (ZSS) based on foundation models for image and text. We propose a training-free approach using multimodal feature matching that performs segmentation by identifying regions in a target image that match the features from both the image and text references. Experimental results demonstrate that the proposed method outperforms state-of-the-art FSS and ZSS methods.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Joxan Jaffar
Journal of the ACM
Kenneth L. Clarkson, Elad Hazan, et al.
Journal of the ACM