Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

arXiv 2026 (v2); under review at IEEE TPAMI · 2026 · ★★★★☆4/5

My reading notes

Why it matters

Directly relevant to Arun's GeoAI and trustworthy spatiotemporal ML work: it shows fidelity metrics (PSNR/SSIM) can correlate negatively with downstream Earth-monitoring task performance, reinforcing his "visual sharpness can be spurious / distortion-aware" framing and his MIRROR position that generated detail must be evaluated against real tasks, not just appearance. The petabyte-scale spatio-temporal join via Apache Sedona also overlaps his ML-systems and spatial-data interests.

Summary

The paper argues that remote-sensing super-resolution (SR) is almost always benchmarked on perceptual fidelity (PSNR, SSIM, LPIPS), but the real value of a sharper satellite image is whether it helps downstream Earth-observation tasks like land-cover mapping, infrastructure (building/road) extraction, and biophysical regression (canopy height, gross primary production). To test this, the authors build GeoSR-Bench: a large, globally distributed dataset of spatially co-located, temporally aligned, cloud/snow-filtered image pairs across roughly 36,000 locations, covering two practically important cross-platform SR tasks, MODIS to Landsat-8 (500m to 30m) and Sentinel-2 to NAIP (10m to 0.6m). Both tasks require about a 16.7x resolution increase. Coincident pairs are discovered at scale (tens of millions of images) using Apache Sedona spatio-temporal joins over lightweight metadata (image IDs, footprints, timestamps) rather than the petabyte-scale pixel data, then stratified by land cover (urban vs non-urban) for representative sampling.

Across a 270-setting evaluation grid, the headline result is that improvements in PSNR/SSIM frequently fail to predict, and can even anti-correlate with, downstream task gains. Pixel-L1 transformer and neural-operator models win on fidelity, while GAN/diffusion models produce sharper but lower-fidelity output that, in the higher-resolution NAIP setting, can yield the largest downstream gains. A persistent gap remains between SR-based and true high-resolution downstream performance, and for targets smaller than a low-resolution pixel, SR offers essentially no help.

Key ideas

First SR benchmark for remote sensing that ties resolution gains to concrete downstream tasks (5 per SR task, 10 total): river/urban/cropland segmentation plus GPP and canopy-height regression for the coarse-to-medium MODIS-to-Landsat-8 task; buildings, roads, two land-cover datasets, and canopy height for the medium-to-high Sentinel-2-to-NAIP task.
Large 270-setting evaluation grid: 2 cross-platform SR tasks x 9 SR models (transformer ATD/RGT/CFAT/CAMixer, neural operator SRNO, GAN ESRGAN/SeD, diffusion BiDiff/UPSR) x 3 downstream models (U-Net, SegFormer, Swin) x 5 tasks.
Headline finding: gains in PSNR/SSIM often do NOT predict downstream task gains, and the Pearson/Spearman correlations within the top-k competitive models are frequently weak or negative, so fidelity metrics give limited guidance for model selection.
Training-objective effect: pixel-L1 transformer and neural-operator models win on PSNR/SSIM, while GAN/diffusion models produce sharper, more perceptually detailed output yet score worse on fidelity, illustrating the fidelity-vs-utility gap (SeD gives the largest downstream gains on NAIP tasks despite lower PSNR, with improvements ranging from ~4% to over 30%).
A persistent gap remains between SR-based and true high-resolution downstream performance; best models (ATD, SRNO) recover under one-third of the lower-to-higher resolution gap in most MODIS-Landsat cases, and worse models (BiDiff, ESRGAN) sometimes fall below the low-resolution baseline.
Sub-pixel limit: for targets smaller than a low-resolution pixel (TreeFinder individual dead-tree mapping), SR gives near-zero downstream improvement because generated detail need not align with true object locations, making recovery effectively ill-posed.

Takeaways for my work

For MIRROR and any generative/SR work, adopt task-integrated evaluation as a first-class metric track, not just fidelity; this benchmark is a ready citation that fidelity can be misleading and even adversarial to downstream utility.
Strong evidence for the distortion-aware thesis: spurious or misaligned synthetic detail (sharp but wrong) actively degrades downstream maps, which is exactly the deception/distortion failure mode Arun studies; useful framing for research and faculty statements.
Reusable systems pattern: discovering spatio-temporally coincident pairs across tens of millions of tiles via Apache Sedona on metadata only (IDs, footprints, timestamps) avoids touching PB-scale pixels, a clean blueprint for his Slurm-trained / K8s-served spatial data pipelines.
Practical model-selection caution for GeoAI projects: prefer GAN/diffusion SR when downstream utility (especially fine-scale 0.6m NAIP tasks) matters, and do not rank SR models by PSNR/SSIM alone; report downstream F1/MAE.

remote sensingsuper-resolutionGeoAIbenchmark datasetdownstream task evaluation