Vision-Language Models for Multimedia Applications:

From Foundations to State-of-the-Art

ACM Multimedia Asia 2024 Tutorial

December 3 - 6, 2024

Auckland, New Zealand

Introduction

Vision-Language Models (VLMs) are revolutionizing the multimedia landscape by seamlessly integrating visual and textual data for a wide range of applications, such as image captioning, Visual Question Answering (VQA), and multimodal retrieval. This tutorial will explore both foundational and state-of-the-art VLMs, providing attendees with a deep understanding of how these models function and how they can be applied effectively.

Participants will explore the evolution of VLMs from classical architectures like CNNs and RNNs to cutting-edge transformer-based models such as CLIP, BLIP, and more. The tutorial will also focus on key challenges such as scaling these models, optimizing their performance, and improving their interpretability for real-world multimedia applications.


Speaker

Yanbin

Dr. Yanbin Liu

Lecturer (Assistant Professor), Auckland University of Technology

Dr. Yanbin Liu is an expert in deep learning and Vision-Language Models, with a focus on their application in multimedia systems. He has published over 30 high-impact research papers in top-tier venues, including CVPR, ICCV, ECCV, and ICLR, amassing over 1,400 citations. Dr. Liu’s research interests center around the integration of visual and textual data, AI-driven content generation, and multimedia retrieval. He has served as Area Chair for ACM Multimedia 2024 and AJCAI 2024, and is a two-time recipient of the CVPR Outstanding Reviewer Award (2021, 2024).


Schedule

Session Time
Session 1: Introduction to Vision-Language Modeling 09:00 AM - 09:45 AM
Session 2: Vision-Language Modeling Using Deep Learning 09:45 AM - 10:30 AM
Morning Tea 10:30 AM - 10:45 AM
Session 3: Recent Advances in Vision-Language Models 10:45 AM - 11:30 AM
Session 4: Challenges and Future Directions 11:30 AM - 12:15 PM
Back to Top