SatWellMCQ: A Vision–Language Satellite Dataset for MCQ-Based Image Grounding of Oil Well

Ahmed Emam; Sultan Alrowili; Mathan Kumar Eswaran; Romeo Kienzler; Younes Samih

EGU 2026

Talk

03 May 2026

SatWellMCQ: A Vision–Language Satellite Dataset for MCQ-Based Image Grounding of Oil Well

Abstract

Monitoring oil and gas wells is essential for assessing environmental degradation and understanding long‑term impacts such as methane emissions from abandoned and orphaned wells. Satellite imagery combined with machine learning provides scalable capabilities for identifying and characterizing oil and gas infrastructure, yet progress remains limited by the absence of multimodal, multiple‑choice (MCQ) vision–language datasets that support image‑grounded reasoning. Existing resources are almost entirely visual‑only and therefore do not enable systematic evaluation or post‑training of vision–language models (VLMs) for well interpretation. To address this gap, we introduce SatWellMCQ, a vision–language dataset of expert‑verified satellite imagery paired with textual descriptions and multiple‑choice supervision designed for image‑grounded identification and localization of oil wells.

SatWellMCQ provides high‑resolution multispectral Planet satellite imagery coupled with natural‑language annotations that describe well types and their spatial context. Each sample contains one expert‑verified description and three semantically plausible distractor descriptions drawn from other examples, supporting structured MCQ‑based evaluation. All samples were manually verified by a senior domain expert with 100% intra‑expert agreement, ensuring accurate alignment between images, labels, and text. The dataset spans four categories relevant to oil and gas well interpretation—active wells, suspended wells, abandoned wells, and control samples—yielding a balanced distribution for training and evaluation. We publicly release SatWellMCQ to support research on grounded infrastructure understanding and vision–language adaptation.

We evaluate SatWellMCQ across a wide range of VLMs in both zero‑shot and supervised fine‑tuning (SFT) settings. In the zero‑shot setup, large‑scale models achieve moderate accuracy despite their broad multimodal capabilities. The best performance is obtained by Qwen3‑VL‑235B with an accuracy of 0.670, followed by other large models with accuracies between 0.416 and 0.600. Compact models transfer poorly in zero‑shot settings (e.g., Granite 3.3 2B at 0.422; Phi‑4‑multimodal‑instruct 6B at 0.376), demonstrating the difficulty of domain‑specific oil well interpretation without targeted supervision. Supervised fine‑tuning on SatWellMCQ leads to substantial improvements for compact models: Granite 3.3 2B improves to 0.722, and Phi‑4‑multimodal‑instruct 6B reaches 0.730, surpassing all zero‑shot baselines. These results show that SatWellMCQ poses a significant challenge to current VLMs while enabling effective domain adaptation through structured MCQ supervision.

Our main contributions are: (1) We introduce and publicly release the SatWellMCQ dataset, pairing expert‑verified satellite imagery with structured MCQ annotations to support research on grounded infrastructure understanding and vision–language adaptation. (2) We show that SatWellMCQ is challenging for current VLMs: the best zero‑shot large model (Qwen3‑VL‑235B) reaches 0.67 accuracy, while fine‑tuning enables compact models such as Granite‑2B and Phi‑4 6B to reach 0.73, demonstrating its value as a post‑training resource.

Overall, SatWellMCQ provides a resource for training and evaluating image‑grounded reasoning in satellite imagery of oil wells and demonstrates the importance of domain‑specific supervision for advancing VLMs in geoscientific applications.

Paper