Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

arXiv 2026 (v1, preprint; IEEE journal format) · 2026 · ★★★½☆3.5/5

My reading notes

Why it matters

Directly relevant to Arun's GeoAI and trustworthy spatiotemporal ML work: a rigorous, evidence-backed warning that fidelity metrics mislead when generative/SR models feed Earth-monitoring pipelines, echoing his spurious-detail / distortion-aware thesis framing. Co-authored by spatial-AI names in his exact venue community (Xiaowei Jia, Gengchen Mai, Yiqun Xie).

Summary

The paper argues that the satellite super-resolution (SR) literature optimizes and reports the wrong thing. Image-fidelity metrics like PSNR and SSIM dominate evaluation, but the actual value of a super-resolved satellite image is whether it helps downstream Earth-monitoring tasks (land-cover segmentation, infrastructure mapping, biophysical regression). The authors build GeoSR-Bench, which they position as the first SR benchmark to connect resolution enhancement directly to downstream task performance rather than perceptual similarity.

The benchmark covers two cross-platform, real (not synthetically downsampled) SR settings spanning a roughly 16x scale jump: coarse-to-medium (MODIS 500m to Landsat-8 30m) and medium-to-high (Sentinel-2 10m to NAIP 0.6m). Data are spatially co-located, temporally aligned, and quality-controlled image pairs drawn from about 36,000 locations, with stratified land-cover sampling that deliberately over-weights texture-rich urban scenes where SR matters most. Each SR setting ships five pixel-level downstream tasks (segmentation plus regression targets like gross primary production and canopy height), with labels sourced from established products (ESA WorldCover, USDA CDL, NLCD, Microsoft Buildings/Roads, etc.).

Across 270 experimental settings (9 SR models spanning transformer, neural-operator, GAN, and diffusion families; 3 task models: UNet, SegFormer, Swin Transformer; 5 tasks per SR setting), the headline finding is that fidelity gains do not reliably track downstream gains, and the correlation can be negative, especially among the top-performing models and notably in the harder Sentinel-2-to-NAIP regime. A top-k correlation analysis sharpens this: agreement between PSNR/SSIM and task utility appears mainly as k grows to include weaker models, whereas within the competitive top groups that actually matter for model selection the fidelity metrics give little or even misleading guidance. The dataset, trained models, and protocols are open-sourced.

Key ideas

Takeaways for my work

remote-sensingsuper-resolutionbenchmarkGeoAIdownstream-task-evaluation