Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

arXiv 2026 (v2); under review at IEEE TPAMI · 2026 · ★★★★☆4/5

My reading notes

Why it matters

Directly relevant to Arun's GeoAI and trustworthy spatiotemporal ML work: it shows fidelity metrics (PSNR/SSIM) can correlate negatively with downstream Earth-monitoring task performance, reinforcing his "visual sharpness can be spurious / distortion-aware" framing and his MIRROR position that generated detail must be evaluated against real tasks, not just appearance. The petabyte-scale spatio-temporal join via Apache Sedona also overlaps his ML-systems and spatial-data interests.

Summary

The paper argues that remote-sensing super-resolution (SR) is almost always benchmarked on perceptual fidelity (PSNR, SSIM, LPIPS), but the real value of a sharper satellite image is whether it helps downstream Earth-observation tasks like land-cover mapping, infrastructure (building/road) extraction, and biophysical regression (canopy height, gross primary production). To test this, the authors build GeoSR-Bench: a large, globally distributed dataset of spatially co-located, temporally aligned, cloud/snow-filtered image pairs across roughly 36,000 locations, covering two practically important cross-platform SR tasks, MODIS to Landsat-8 (500m to 30m) and Sentinel-2 to NAIP (10m to 0.6m). Both tasks require about a 16.7x resolution increase. Coincident pairs are discovered at scale (tens of millions of images) using Apache Sedona spatio-temporal joins over lightweight metadata (image IDs, footprints, timestamps) rather than the petabyte-scale pixel data, then stratified by land cover (urban vs non-urban) for representative sampling.

Across a 270-setting evaluation grid, the headline result is that improvements in PSNR/SSIM frequently fail to predict, and can even anti-correlate with, downstream task gains. Pixel-L1 transformer and neural-operator models win on fidelity, while GAN/diffusion models produce sharper but lower-fidelity output that, in the higher-resolution NAIP setting, can yield the largest downstream gains. A persistent gap remains between SR-based and true high-resolution downstream performance, and for targets smaller than a low-resolution pixel, SR offers essentially no help.

Key ideas

Takeaways for my work

remote sensingsuper-resolutionGeoAIbenchmark datasetdownstream task evaluation