Tutorial: Knowledge-Driven Vision-Language Pretraining

* Preprint. Under review.
Manling Li, Xudong Lin, Jie Lei, Mohit Bansal, Heng Ji, Shih-Fu Chang
Contact: manling2@illinois.edu, xudong.lin@columbia.edu, jielei@cs.unc.edu, mbansal@cs.unc.edu, hengji@illinois.edu, sc250@columbia.edu


Recent years witness the great success of vision-language (V+L) pretraining models in multimedia applications by learning the alignments between vision and text. The understanding of entity knowledge (i.e., objects and object types) is the fundamental ability for a wide variety of V+L tasks, such as image captioning and visual question answering. They also require the capability of understanding relational knowledge (i.e., scene graphs), which can further support compositional visual question answering, scene graph parsing, etc. On top of that, event knowledge (i.e., event types, actions, activities) with event argument structures (i.e., entities involved and their semantic roles) are critical to support cognition-level visual understanding, such as visual commonsense reasoning, situation recognition, action recognition and human object interaction. To track status changes of events and entities, procedural knowledge is induced for video question answering, action recognition, action segmentation, action localization, action prediction and procedural planning. Instead of explicitly gaining structured knowledge, the knowledge in language models can also benefit vision-language pretraining. Consequently, adding knowledge into vision-language pretraining poses two key challenges, obtaining knowledge at multiple levels, and encoding the structure and semantics of the knowledge.

Figure 1: In this tutorial, we will present advanced vision-language methods that incorporate knowledge from a variety of sources.

In this tutorial, we will comprehensively review existing paradigms for multimedia knowledge discovery and encoding, and focus on their contributions to vision-language pretraining. We categorize the knowledge into internal self-knowledge and external knowledge. Internal knowledge are extracted from text and vision modalities, such as structured entities, relations, events, and event procedures. We will focus on the structural aspects of the knowledge and address two key challenges regarding the acquisition of knowledge and encoding of structure across multiple modalities. External knowledge can be obtained from knowledge bases or language models, and we will exemplify their use to assist in commonsense understanding of vision modalities, with a focus on the temporal and cognitive aspects. The objective of this tutorial is to introduce participants to recent trends and emerging challenges in knowledge-driven vision-language research, as well as learning resources and tools for participants to obtain ready-to-use models, prompting thorough discussions regarding the impact of structured knowledge on text and vision learning.

Tutorial Schedule

Session Time Presenter Link
Motivation 10min Manling Slides , Videos (to appear)
Knowledge Element: Entities, Relations and Events 35min Manling Slides , Videos (to appear)
Procedural Knowledge 35min Xudong Slides , Videos (to appear)
Knowledge Distillation from Language Models and Other Models 30min Jie Slides , Videos (to appear)
Remaining Challenges 10min Manling Slides , Videos (to appear)
Panel Discussions about Future Directions 30min Shih-Fu, Heng, Mohit, Manling, Xudong, Jie Videos (to appear)


Reading List        Description (Under Review)       



Please contact Manling Li (manling2@illinois.edu) if you have any questions.