Remote Sensing Large Models: Review and Future Prospects

[1] ZHANG L, ZHANG L. Artificial intelligence for remote sensing data analysis： A review of challenges and opportunities. IEEE Geoscience and Remote Sensing Magazine, 10, 270-294(2022).

[2] SUN Z, SANDOVAL L, Crystal-Ornelas R et al. A review of earth artificial intelligence. Computers Geosciences, 159, 105034(2022).

[3] ZHOU C, LI Q, LI C et al. A comprehensive survey on pretrained foundation models： A History from BERT to ChatGPT. arXiv Preprint(2023).

[4] TEAM G, ANIL R, BORGEAUD S et al. Gemini：A family of highly capable multimodal models. arXiv Preprint(2023).

[5] RADFORD A, KIM J W, HALLACY C et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).

[6] BOMMASANI R, HUDSON D A, ADELI E et al. On the opportunities and risks of foundation models. arXiv Preprint(2021).

[7] ROLF E, KLEMMER K, ROBINSON C et al. Mission critical-satellite data is a distinct modality in machine learning. arXiv Preprint(2024).

[8] KIRILLOV A, MINTUN E, RAVI N et al. Segment anything, 4015-4026(2023).

[9] DOSOVITSKIY A. An image is worth 16x16 words： Transformers for image recognition at scale. arXiv Preprint(2020).

[10] LIU Z, LIN Y, CAO Y et al. Swin transformer： Hierarchical vision transformer using shifted windows, 10012-10022(2021).

[11] DAI Z, LIU H, LE Q V et al. Coatnet： Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965-3977(2021).

[12] YE Q, XU H, XU G et al. mplug-owl： Modularization empowers large language models with multimodality. arXiv Preprint(2023).

[13] LI J, SELVARAJU R, GOTMARE A et al. Align before fuse： Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694-9705(2021).

[14] LI J, LI D, XIONG C et al. Blip： Bootstrapping language-image pre-training for unified vision-language understanding and generation, 12888-12900(2022).

[15] J-B ALAYRAC, DONAHUE J, LUC P et al. Flamingo：A vi-sual language model for few-shot learning. Advances in Neu-ral Information Processing Systems, 35, 23716-23736(2022).

[16] LIU H, LI C, WU Q et al. Visual instruction tuning. arXiv Preprint(2023).

[17] PENG Z, WANG W, DONG L et al. Kosmos-2： Grounding multimodal large language models to the world. arXiv Preprint(2023).

[18] REBUFFI S, BILEN H, VEDALDI A. Learning multiple visual domains with residual adapters. arXiv Preprint(2017).

[19] XING J, LIU J, WANG J et al. A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter. Computers & Graphics, 119, 103885(2024).

[20] ROLF E, KLEMMER K, ROBINSON C et al. Mission critical-Satellite data is a distinct modality in machine learning. arXiv Preprint(2024).

[21] FULLER A, MILLARD K, GREEN JRJAPA. Transfer learning with pretrained remote sensing transformers. arXiv Preprint(2022).

[22] NOMAN M, FIAZ M, CHOLAKKAL H et al. Remote sensing change detection with transformers trained from scratch. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-14(2024).

[23] WANG D, ZHANG J, DU B et al. An empirical study of remote sensing pretraining. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-20(2023).

[24] JING L, TIAN Y. Self-Supervised Visual Feature Learning With Deep Neural Networks：A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 4037-4058(2020).

[25] LIU X, ZHANG F, HOU Z et al. Self-supervised learning： Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35, 857-876(2021).

[26] STOJNIC V, RISOJEVIC V. Self-supervised learning of remote sensing scene representations using contrastive multiview coding, 1182-1191(2021).

[27] WANG Y, ALBRECHT C M, BRAHAM N A A et al. Self-supervised learning in remote sensing：A review. IEEE Geoscience and Remote Sensing Magazine, 10, 213-247(2022).

[28] PENG B, HUANG Q, VONGKUSOLKIT J et al. Urban flood mapping with bitemporal multispectral imagery via a self-supervised learning framework. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 2001-2016(2021).

[29] JIN Q, MA Y, FAN F et al. Adversarial autoencoder network for hyperspectral unmixing. IEEE Transactions on Neural Networks and Learning Systems, 34, 4555-4569(2021).

[30] BROCK A, DONAHUE J, K J A E-P SIMONYAN. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint(2018).

[31] LEDIG C, THEIS L, HUSZAR F et al. Photo-realistic single image super-resolution using a generative adversarial network, 4681-4690(2017).

[32] CHEN T, KORNBLITH S, NOROUZI M et al. A simple framework for contrastive learning of visual representations, 1597-1607(2020).

[33] HE K, FAN H, WU Y et al. Momentum contrast for unsupervised Visual Representation Learning, 9729-9738(2020).

[34] CHEN X, FAN H, GIRSHICK R et al. Improved baselines with momentum contrastive learning. arXiv Preprint(2020).

[35] CHEN X, XIE S, HE K J A E-P. An empirical study of training self-supervised visual transformers. arXiv Preprint(2021).

[36] CARON M, BOJANOWSKI P, JOULIN A et al. Deep clustering for unsupervised learning of visual features. arXiv Preprint(2018).

[37] CARON M, MISRA I, MAIRAL J et al. Unsupervised learning of visual features by contrasting cluster assignments. arXiv Preprint(2020).

[38] J-B GRILL, STRUB F, ALTCHé F et al. Bootstrap Your Own Latent： A new approach to self-supervised learning. arXiv Preprint(2020).

[39] CARON M, TOUVRON H, MISRA I et al. Emerging properties in Self-Supervised Vision Transformers. arXiv Preprint(2021).

[40] LONG Y, XIA G S, LI S et al. On creating benchmark dataset for aerial image interpretation：Reviews，guidances，and million-AID. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 4205-4230(2021).

[41] JI H, GAO Z, ZHANG Y et al. Few-Shot scene classification of optical remote sensing images leveraging calibrated pretext tasks. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13(2022).

[42] GONZáLEZ-SANTIAGO J, SCHENKEL F MIDDEL-MANN W. Self-supervised image colorization for semantic segmentation of urban land cover, 3468-3471(2021).

[43] LI X, WEN C, HU Y et al. Vision-language models in remote sensing：Current progress and future trends. IEEE Geoscience and Remote Sensing Magazine, 12, 32-66(2024).

[44] CHAPPUIS C, ZERMATTEN V, LOBRY S et al. Prompt-RSVQA： Prompting visual context to a language model for remote sensing visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19, 1372-1381(2022).

[45] MALL U, PHOO C P, LIU M K et al. Remote sensing vision-language foundation models without annotations via ground remote alignment. arXiv Preprint(2023).

[46] SMITH M J, FLEMING LGEACH JEJAE P. EarthPT：A time series foundation model for earth observation. arXiv Preprint(2023).

[47] GUO X, LAO J, DANG B et al. Skysense： A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16, 27672-27683(2024).

[48] BOUNTOS N, OUAKNINE A, ROLNICK D. Fomo-bench：A multi-modal， multi-scale and multitask forest monitoring benchmark for remote sensing foundation models. arXiv Preprint(2023).

[49] WEI T, YUAN W, LUO J et al. Vlca： vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning. Journal of Systems Engineering and Electronics, 34, 9-18(2023).

[50] KIM W, KIM I. Vilt： Visionand-language transformer without convolution or region supervision. arXiv Preprint(2021).

[51] LI LH, ZHANG P, ZHANG H et al. Grounded languageimage pre-training. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18, 10965-10975(2022).

[52] LIU F, CHEN D, GUAN Z et al. Remoteclip： A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-16(2024).

[53] ZHANG J, ZHOU Z et al. Text2seg： Remote sensing image semantic segmentation via text-guided visual foundation models. arXiv Preprint(2023).

[54] MAZUROWSKI M, DONG H, GU H et al. Segment anything model for medical image analysis：An experimental study. Medical Image Analysis, 89, 102918(2023).

[55] LIU S, ZENG Z, REN T et al. Grounding dino： Marrying dino with grounded pre-training for open-set object detection, 38-55(2025).

[56] LOBRY S, MARCOS D, MURRAY J et al. Rsvqa： Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58, 8555-8566(2020).

[57] RAHHAL M M AL, BAZI Y, ALSALEH S O et al. Open-ended remote sensing visual question answering with transformers. International Journal of Remote Sensing, 43, 6809-6823(2022).

[58] BAZI Y, RAHHAL M M AL, MEKHALFI M L et al. Bimo-dal transformer-based approach for visual question answering in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-11(2022).

[59] CHRISTEL C, VALERIE Z, SYLVAIN L et al. Prompt-RSVQA： Prompting visual context to a language model for remote sensing visual question answering, 1372-1381(2022).

[60] YUAN Z, MOU L, WANG Q et al. From easy to hard： Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-11(2022).

[61] LOBRY S, TUIA D. Visual question answering on remote sensing images. Advances in Machine Learning and Image Analysis for GeoAI, 237-254(2024).

[62] MIKRIUKOV G, RAVANBAKHSH M, DEMIR BJAPA. Deep unsupervised contrastive hashing for largescale cross-modal text-image retrieval in remote sensing. arXiv Preprint(2022).

[63] HU Y, YUAN J, WEN C et al. Rsgpt： A remote sensing vision language model and benchmark. arXiv Preprint(2023).

[64] ZHAN Y, XIONG Z, YUAN YJAPA. Skyeyegpt：Unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv Preprint(2024).

[65] WANG J, LIU F, JIAO L et al. Sscfnet： A spatialspectral cross fusion network for remote sensing change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 4000-4012(2023).

[66] LI K, CAO X, MENG DJITOG et al. A new learning paradigm for foundation model-based remote-sensing change detection. IEEE Transactions on Geoscience and Remote Sen-sing, 62, 1-12(2024).

[67] WANG M, CHEN K, SU L et al. RSBuilding： Towards general remote sensing image building extraction and change detection with Foundation Model. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-17(2024).

[68] SUN X, WANG P, LU W et al. RingMo：A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-22(2023).

[69] CHEN K, LIU C, CHEN H et al. RSPrompter： Learning to prompt for remote sensing instance segmentation based on visual foundation model, 62, 3356074(2024).

[70] HONG D, ZHANG B, LI X et al. SpectralGPT： Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 5227-5244(2024).

[71] KUCKREJA K, DANISH M, NASEER M et al. Geochat： Grounded large vision-language model for remote sensing. arXiv Preprint(2023).

[72] WANG Y, ALBRECHT C, ZHU X. Self-supervised vision transformers for joint SAR-optical representation learning. arXiv Preprint(2022).

[73] ZHANG W, CAI M, ZHANG T et al. EarthGPT： A universal multimodal large language model for multisensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-20(2024).

[74] WANG Z, PRABHA R, HUANG T et al. SkyScript：A large and semantically diverse vision-language dataset for remote sensing. arXiv Preprint(2023).

[75] LEVKOVICH IELYOSEPH Z. Suicide risk assessments thro-ugh the eyes of ChatGPT-3.5 versus ChatGPT-4：Vignette study. JMIR Mental Health, 10(2023).

[76] CORLEY I, ROBINSON C, DODHIA R et al. Revisiting pre-trained remote sensing model benchmarks：Resizing and normalization matters. arXiv Preprint(2023).

[77] DONG R, YUAN S, LUO B et al. Building bridges across spatial and temporal resolutions： Reference-based super-resolution via change priors and conditional diffusion model. arXiv Preprint(2024).

[78] ROMBACH R, BLATTMANN A, LORENZ D et al. High-resolution image synthesis with latent diffusion models, 10684-10695(2022).

Tools

Get Citation

Copy Citation Text

Shuaihao ZHANG, Zhigang PAN. Remote Sensing Large Models: Review and Future Prospects[J]. Remote Sensing Technology and Application, 2025, 40(1): 1

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites