Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

arXiv 2026 (v1, preprint; IEEE journal format) · 2026 · ★★★½☆3.5/5

My reading notes

Why it matters

Directly relevant to Arun's GeoAI and trustworthy spatiotemporal ML work: a rigorous, evidence-backed warning that fidelity metrics mislead when generative/SR models feed Earth-monitoring pipelines, echoing his spurious-detail / distortion-aware thesis framing. Co-authored by spatial-AI names in his exact venue community (Xiaowei Jia, Gengchen Mai, Yiqun Xie).

Summary

The paper argues that the satellite super-resolution (SR) literature optimizes and reports the wrong thing. Image-fidelity metrics like PSNR and SSIM dominate evaluation, but the actual value of a super-resolved satellite image is whether it helps downstream Earth-monitoring tasks (land-cover segmentation, infrastructure mapping, biophysical regression). The authors build GeoSR-Bench, which they position as the first SR benchmark to connect resolution enhancement directly to downstream task performance rather than perceptual similarity.

The benchmark covers two cross-platform, real (not synthetically downsampled) SR settings spanning a roughly 16x scale jump: coarse-to-medium (MODIS 500m to Landsat-8 30m) and medium-to-high (Sentinel-2 10m to NAIP 0.6m). Data are spatially co-located, temporally aligned, and quality-controlled image pairs drawn from about 36,000 locations, with stratified land-cover sampling that deliberately over-weights texture-rich urban scenes where SR matters most. Each SR setting ships five pixel-level downstream tasks (segmentation plus regression targets like gross primary production and canopy height), with labels sourced from established products (ESA WorldCover, USDA CDL, NLCD, Microsoft Buildings/Roads, etc.).

Across 270 experimental settings (9 SR models spanning transformer, neural-operator, GAN, and diffusion families; 3 task models: UNet, SegFormer, Swin Transformer; 5 tasks per SR setting), the headline finding is that fidelity gains do not reliably track downstream gains, and the correlation can be negative, especially among the top-performing models and notably in the harder Sentinel-2-to-NAIP regime. A top-k correlation analysis sharpens this: agreement between PSNR/SSIM and task utility appears mainly as k grows to include weaker models, whereas within the competitive top groups that actually matter for model selection the fidelity metrics give little or even misleading guidance. The dataset, trained models, and protocols are open-sourced.

Key ideas

GeoSR-Bench reframes SR evaluation around downstream pixel-level tasks (segmentation and regression) instead of PSNR/SSIM, using two real cross-platform settings: MODIS to Landsat-8 and Sentinel-2 to NAIP.
Uses real, co-located, temporally aligned image pairs (not synthetic downsampling) with stratified urban/non-urban land-cover sampling emphasizing texture-rich urban scenes; about 36,000 locations total (roughly 20k SR pairs plus 16k downstream-labeled sets).
Benchmarks 9 SR models across four paradigms (transformer, neural operator, GAN, diffusion) feeding 3 downstream models over 5 tasks each, totaling 270 settings.
Central result: PSNR/SSIM improvements often do not correlate with downstream task gains, and correlations can be negative, particularly for the strongest models and the harder medium-to-high task.
A top-k correlation analysis shows fidelity-vs-utility agreement strengthens only as k grows to include weaker models; among the competitive top groups the metrics are uninformative or misleading for model selection.
Sharper SR outputs can hallucinate spurious detail that degrades downstream maps, motivating task-integrated SR development rather than fidelity-only optimization.

Takeaways for my work

For MIRROR/DAFM and any generative spatial model, do not validate solely on reconstruction fidelity: bake a downstream task probe (segmentation/regression accuracy) into the eval harness, since fidelity can anti-correlate with utility.
The 'spurious details hurt decisions' result is direct empirical support for Arun's distortion-aware framing; cite it when arguing generative outputs need physics/task constraints, not just visual realism.
GeoSR-Bench is a ready-made, open multi-platform evaluation suite (MODIS/Landsat/Sentinel-2/NAIP with aligned labels, code at github.com/ai-spatial/GeoSR-Bench) usable as a testbed or comparison baseline for his RS/SR portfolio work.
Engineering note: the 270-setting matrix (SR model x task model x task) is a clean template for a reproducible, parallelizable benchmarking pipeline on Slurm.

remote-sensingsuper-resolutionbenchmarkGeoAIdownstream-task-evaluation