A Review of Multimodal Vision-Language Models: Foundations, Applications, and Future Directions

Gurpreet  Singh

ПРЕПРИНТ

Эта статья является препринтом и не была отрецензирована.
О результатах, изложенных в препринтах, не следует сообщать в СМИ как о проверенной информации.

A Review of Multimodal Vision-Language Models: Foundations, Applications, and Future Directions

G. Singh

2025-11-01

https://doi.org/10.24108/preprints-3113823

Large Language Models (LLMs) have rapidly become a central focus in both research and practical applications, owing to their remarkable ability to understand and generate text with a level of fluency comparable to human communication. Recently, these models have evolved into multimodal large language models (MM-LLMs), extending their capabilities beyond text to include images, audio, and video. This advancement has enabled a wide array of applications, including text-to-video synthesis, image captioning, and text-to-speech systems. MM-LLMs are developed either by augmenting existing LLMs with multi-modal functionality or by designing multi-modal architectures from the ground up. This paper presents a comprehensive review of the current landscape of LLMs with multi-modal capabilities, highlighting both foundational and cutting-edge MM-LLMs. It traces the historical development of LLMs, emphasizing the transformative impact of transformer-based architectures such as OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in improving model performance. The review also examines key strategies for adapting pre-trained models to specific tasks, including fine-tuning and prompt engineering. Ethical challenges, including data bias and the potential for misuse, are discussed to stress the importance of responsible AI deployment. Finally, we explore the implications of open-source versus proprietary models for advancing research in this field. By synthesizing these insights, this paper underscores the significant potential of MM-LLMs to reshape diverse applications across multiple domains.

Ссылка для цитирования:

Singh G. 2025. A Review of Multimodal Vision-Language Models: Foundations, Applications, and Future Directions. PREPRINTS.RU. https://doi.org/10.24108/preprints-3113823

Список литературы

ПРЕПРИНТ

ИНФОРМАЦИЯ

Использование куки-файлов