Tutorial: Knowledge-Driven Vision-Language Encoding

CVPR 2023 Tutorial.
Time: Jun 19, 9am-12:30pm PST
Location: East 8, Vancouver Convention Centre, Vancouver, Canada
Zoom: https://cvpr2023.thecvf.com/virtual/2023/tutorial/18576
Presenters: Manling Li, Xudong Lin, Jie Lei
Panelists: Mohit Bansal, Carl Vondrick, Heng Ji, Shih-Fu Chang
Contact: manling2@illinois.edu, xudong.lin@columbia.edu, jielei@cs.unc.edu, mbansal@cs.unc.edu, hengji@illinois.edu, sc250@columbia.edu


About

Recent years witness the great success of vision-language (V+L) pretraining models in multimedia applications by learning the alignments between vision and text. The understanding of entity knowledge (i.e., objects and object types) is the fundamental ability for a wide variety of V+L tasks, such as image captioning and visual question answering. They also require the capability of understanding relational knowledge (i.e., scene graphs), which can further support compositional visual question answering, scene graph parsing, etc. On top of that, event knowledge (i.e., event types, actions, activities) with event argument structures (i.e., entities involved and their semantic roles) are critical to support cognition-level visual understanding, such as visual commonsense reasoning, situation recognition, action recognition and human object interaction. To track status changes of events and entities, procedural knowledge is induced for video question answering, action recognition, action segmentation, action localization, action prediction and procedural planning. Instead of explicitly gaining structured knowledge, the knowledge in language models can also benefit vision-language pretraining. Consequently, adding knowledge into vision-language pretraining poses two key challenges, obtaining knowledge at multiple levels, and encoding the structure and semantics of the knowledge.


Figure 1: In this tutorial, we will present advanced vision-language methods that incorporate knowledge from a variety of sources.

In this tutorial, we will comprehensively review existing paradigms for multimedia knowledge discovery and encoding, and focus on their contributions to vision-language pretraining. We categorize the knowledge into internal self-knowledge and external knowledge. Internal knowledge are extracted from text and vision modalities, such as structured entities, relations, events, and event procedures. We will focus on the structural aspects of the knowledge and address two key challenges regarding the acquisition of knowledge and encoding of structure across multiple modalities. External knowledge can be obtained from knowledge bases or language models, and we will exemplify their use to assist in commonsense understanding of vision modalities, with a focus on the temporal and cognitive aspects. The objective of this tutorial is to introduce participants to recent trends and emerging challenges in knowledge-driven vision-language research, as well as learning resources and tools for participants to obtain ready-to-use models, prompting thorough discussions regarding the impact of structured knowledge on text and vision learning.



Tutorial Schedule


Session Duration Time Presenter Link
Motivation 15min 9:00-9:15 Manling Slides
Factual Knowledge: Information about Instances 30min 9:15-9:45 Manling Slides
Common Knowledge: Commonsense Knowledge 15min 9:45-10:00 Manling Slides
Common Knowledge: Procedural Knowledge 30min 10:00-10:30 Xudong Slides
Model Knowledge: Knowledge from Language Models and Other Models 30min 10:30-11:00 Jie Slides
Panel: Explicit Knowledge vs Implicit Knowledge - Does knowledge still have values in the era of large-scale pretraining? 15min 11:00-11:15 Mohit, Carl, Xudong, Manling (Moderator)
Panel: LLMs for multimodality - What ability can be borrowed from the language space? 15min 11:15-11:30 Mohit, Carl, Jie, Manling (Moderator)
Panel: What is missing in cross-modal knowledge learning? 15min 11:30-11:45 Carl, Mohit, Xudong, Manling (Moderator)
Panel: Open Challenges? 15min 11:45-12:00 Carl, Mohit, Xudong, Jie, Manling (Moderator)
Q&A 30min 12:00-12:30 Shih-Fu, Heng, Mohit, Xudong, Jie, Manling

Materials

Reading List        Description       


Presenters


Panelist


Contact


Please contact Manling Li (manling2@illinois.edu) if you have any questions.