Grounding Everything: Emerging Localization Properties in Vision-Language TransformersWalid BousselhamFelix Petersenet al.2024CVPR 2024Conference paper
Learning Situation Hyper-Graphs for Video Question AnsweringAisha Urooj KhanHilde Kuehneet al.2023CVPR 2023Conference paper