Remote Sensing Technology and Application, Volume. 40, Issue 1, 1(2025)

Remote Sensing Large Models: Review and Future Prospects

Shuaihao ZHANG and Zhigang PAN*
Author Affiliations
  • Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing100190, China
  • show less
    References(78)

    [1] ZHANG L, ZHANG L. Artificial intelligence for remote sensing data analysis: A review of challenges and opportunities. IEEE Geoscience and Remote Sensing Magazine, 10, 270-294(2022).

    [2] SUN Z, SANDOVAL L, Crystal-Ornelas R et al. A review of earth artificial intelligence. Computers Geosciences, 159, 105034(2022).

    [3] ZHOU C, LI Q, LI C et al. A comprehensive survey on pretrained foundation models: A History from BERT to ChatGPT. arXiv Preprint(2023).

    [4] TEAM G, ANIL R, BORGEAUD S et al. Gemini:A family of highly capable multimodal models. arXiv Preprint(2023).

    [5] RADFORD A, KIM J W, HALLACY C et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).

    [6] BOMMASANI R, HUDSON D A, ADELI E et al. On the opportunities and risks of foundation models. arXiv Preprint(2021).

    [7] ROLF E, KLEMMER K, ROBINSON C et al. Mission critical-satellite data is a distinct modality in machine learning. arXiv Preprint(2024).

    [8] KIRILLOV A, MINTUN E, RAVI N et al. Segment anything, 4015-4026(2023).

    [9] DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv Preprint(2020).

    [10] LIU Z, LIN Y, CAO Y et al. Swin transformer: Hierarchical vision transformer using shifted windows, 10012-10022(2021).

    [11] DAI Z, LIU H, LE Q V et al. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965-3977(2021).

    [12] YE Q, XU H, XU G et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv Preprint(2023).

    [13] LI J, SELVARAJU R, GOTMARE A et al. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694-9705(2021).

    [14] LI J, LI D, XIONG C et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 12888-12900(2022).

    [15] J-B ALAYRAC, DONAHUE J, LUC P et al. Flamingo:A vi-sual language model for few-shot learning. Advances in Neu-ral Information Processing Systems, 35, 23716-23736(2022).

    [16] LIU H, LI C, WU Q et al. Visual instruction tuning. arXiv Preprint(2023).

    [17] PENG Z, WANG W, DONG L et al. Kosmos-2: Grounding multimodal large language models to the world. arXiv Preprint(2023).

    [18] REBUFFI S, BILEN H, VEDALDI A. Learning multiple visual domains with residual adapters. arXiv Preprint(2017).

    [19] XING J, LIU J, WANG J et al. A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter. Computers & Graphics, 119, 103885(2024).

    [20] ROLF E, KLEMMER K, ROBINSON C et al. Mission critical-Satellite data is a distinct modality in machine learning. arXiv Preprint(2024).

    [21] FULLER A, MILLARD K, GREEN JRJAPA. Transfer learning with pretrained remote sensing transformers. arXiv Preprint(2022).

    [22] NOMAN M, FIAZ M, CHOLAKKAL H et al. Remote sensing change detection with transformers trained from scratch. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-14(2024).

    [23] WANG D, ZHANG J, DU B et al. An empirical study of remote sensing pretraining. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-20(2023).

    [24] JING L, TIAN Y. Self-Supervised Visual Feature Learning With Deep Neural Networks:A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 4037-4058(2020).

    [25] LIU X, ZHANG F, HOU Z et al. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35, 857-876(2021).

    [26] STOJNIC V, RISOJEVIC V. Self-supervised learning of remote sensing scene representations using contrastive multiview coding, 1182-1191(2021).

    [27] WANG Y, ALBRECHT C M, BRAHAM N A A et al. Self-supervised learning in remote sensing:A review. IEEE Geoscience and Remote Sensing Magazine, 10, 213-247(2022).

    [28] PENG B, HUANG Q, VONGKUSOLKIT J et al. Urban flood mapping with bitemporal multispectral imagery via a self-supervised learning framework. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 2001-2016(2021).

    [29] JIN Q, MA Y, FAN F et al. Adversarial autoencoder network for hyperspectral unmixing. IEEE Transactions on Neural Networks and Learning Systems, 34, 4555-4569(2021).

    [30] BROCK A, DONAHUE J, K J A E-P SIMONYAN. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint(2018).

    [31] LEDIG C, THEIS L, HUSZAR F et al. Photo-realistic single image super-resolution using a generative adversarial network, 4681-4690(2017).

    [32] CHEN T, KORNBLITH S, NOROUZI M et al. A simple framework for contrastive learning of visual representations, 1597-1607(2020).

    [33] HE K, FAN H, WU Y et al. Momentum contrast for unsupervised Visual Representation Learning, 9729-9738(2020).

    [34] CHEN X, FAN H, GIRSHICK R et al. Improved baselines with momentum contrastive learning. arXiv Preprint(2020).

    [35] CHEN X, XIE S, HE K J A E-P. An empirical study of training self-supervised visual transformers. arXiv Preprint(2021).

    [36] CARON M, BOJANOWSKI P, JOULIN A et al. Deep clustering for unsupervised learning of visual features. arXiv Preprint(2018).

    [37] CARON M, MISRA I, MAIRAL J et al. Unsupervised learning of visual features by contrasting cluster assignments. arXiv Preprint(2020).

    [38] J-B GRILL, STRUB F, ALTCHé F et al. Bootstrap Your Own Latent: A new approach to self-supervised learning. arXiv Preprint(2020).

    [39] CARON M, TOUVRON H, MISRA I et al. Emerging properties in Self-Supervised Vision Transformers. arXiv Preprint(2021).

    [40] LONG Y, XIA G S, LI S et al. On creating benchmark dataset for aerial image interpretation:Reviews,guidances,and million-AID. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 4205-4230(2021).

    [41] JI H, GAO Z, ZHANG Y et al. Few-Shot scene classification of optical remote sensing images leveraging calibrated pretext tasks. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13(2022).

    [42] GONZáLEZ-SANTIAGO J, SCHENKEL F MIDDEL-MANN W. Self-supervised image colorization for semantic segmentation of urban land cover, 3468-3471(2021).

    [43] LI X, WEN C, HU Y et al. Vision-language models in remote sensing:Current progress and future trends. IEEE Geoscience and Remote Sensing Magazine, 12, 32-66(2024).

    [44] CHAPPUIS C, ZERMATTEN V, LOBRY S et al. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19, 1372-1381(2022).

    [45] MALL U, PHOO C P, LIU M K et al. Remote sensing vision-language foundation models without annotations via ground remote alignment. arXiv Preprint(2023).

    [46] SMITH M J, FLEMING LGEACH JEJAE P. EarthPT:A time series foundation model for earth observation. arXiv Preprint(2023).

    [47] GUO X, LAO J, DANG B et al. Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16, 27672-27683(2024).

    [48] BOUNTOS N, OUAKNINE A, ROLNICK D. Fomo-bench:A multi-modal, multi-scale and multitask forest monitoring benchmark for remote sensing foundation models. arXiv Preprint(2023).

    [49] WEI T, YUAN W, LUO J et al. Vlca: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning. Journal of Systems Engineering and Electronics, 34, 9-18(2023).

    [50] KIM W, KIM I. Vilt: Visionand-language transformer without convolution or region supervision. arXiv Preprint(2021).

    [51] LI LH, ZHANG P, ZHANG H et al. Grounded languageimage pre-training. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18, 10965-10975(2022).

    [52] LIU F, CHEN D, GUAN Z et al. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-16(2024).

    [53] ZHANG J, ZHOU Z et al. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models. arXiv Preprint(2023).

    [54] MAZUROWSKI M, DONG H, GU H et al. Segment anything model for medical image analysis:An experimental study. Medical Image Analysis, 89, 102918(2023).

    [55] LIU S, ZENG Z, REN T et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 38-55(2025).

    [56] LOBRY S, MARCOS D, MURRAY J et al. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58, 8555-8566(2020).

    [57] RAHHAL M M AL, BAZI Y, ALSALEH S O et al. Open-ended remote sensing visual question answering with transformers. International Journal of Remote Sensing, 43, 6809-6823(2022).

    [58] BAZI Y, RAHHAL M M AL, MEKHALFI M L et al. Bimo-dal transformer-based approach for visual question answering in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-11(2022).

    [59] CHRISTEL C, VALERIE Z, SYLVAIN L et al. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering, 1372-1381(2022).

    [60] YUAN Z, MOU L, WANG Q et al. From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-11(2022).

    [61] LOBRY S, TUIA D. Visual question answering on remote sensing images. Advances in Machine Learning and Image Analysis for GeoAI, 237-254(2024).

    [62] MIKRIUKOV G, RAVANBAKHSH M, DEMIR BJAPA. Deep unsupervised contrastive hashing for largescale cross-modal text-image retrieval in remote sensing. arXiv Preprint(2022).

    [63] HU Y, YUAN J, WEN C et al. Rsgpt: A remote sensing vision language model and benchmark. arXiv Preprint(2023).

    [64] ZHAN Y, XIONG Z, YUAN YJAPA. Skyeyegpt:Unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv Preprint(2024).

    [65] WANG J, LIU F, JIAO L et al. Sscfnet: A spatialspectral cross fusion network for remote sensing change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 4000-4012(2023).

    [66] LI K, CAO X, MENG DJITOG et al. A new learning paradigm for foundation model-based remote-sensing change detection. IEEE Transactions on Geoscience and Remote Sen-sing, 62, 1-12(2024).

    [67] WANG M, CHEN K, SU L et al. RSBuilding: Towards general remote sensing image building extraction and change detection with Foundation Model. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-17(2024).

    [68] SUN X, WANG P, LU W et al. RingMo:A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing, 61, 1-22(2023).

    [69] CHEN K, LIU C, CHEN H et al. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model, 62, 3356074(2024).

    [70] HONG D, ZHANG B, LI X et al. SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 5227-5244(2024).

    [71] KUCKREJA K, DANISH M, NASEER M et al. Geochat: Grounded large vision-language model for remote sensing. arXiv Preprint(2023).

    [72] WANG Y, ALBRECHT C, ZHU X. Self-supervised vision transformers for joint SAR-optical representation learning. arXiv Preprint(2022).

    [73] ZHANG W, CAI M, ZHANG T et al. EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-20(2024).

    [74] WANG Z, PRABHA R, HUANG T et al. SkyScript:A large and semantically diverse vision-language dataset for remote sensing. arXiv Preprint(2023).

    [75] LEVKOVICH IELYOSEPH Z. Suicide risk assessments thro-ugh the eyes of ChatGPT-3.5 versus ChatGPT-4:Vignette study. JMIR Mental Health, 10(2023).

    [76] CORLEY I, ROBINSON C, DODHIA R et al. Revisiting pre-trained remote sensing model benchmarks:Resizing and normalization matters. arXiv Preprint(2023).

    [77] DONG R, YUAN S, LUO B et al. Building bridges across spatial and temporal resolutions: Reference-based super-resolution via change priors and conditional diffusion model. arXiv Preprint(2024).

    [78] ROMBACH R, BLATTMANN A, LORENZ D et al. High-resolution image synthesis with latent diffusion models, 10684-10695(2022).

    Tools

    Get Citation

    Copy Citation Text

    Shuaihao ZHANG, Zhigang PAN. Remote Sensing Large Models: Review and Future Prospects[J]. Remote Sensing Technology and Application, 2025, 40(1): 1

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Jul. 21, 2024

    Accepted: --

    Published Online: May. 22, 2025

    The Author Email: Zhigang PAN (zgpan@mail.ie.ac.cn)

    DOI:10.11873/j.issn.1004-0323.2025.1.0001

    Topics