Tutorial:
What is the value of knowledge in the era of large-scale pretraining?
Knowledge-Driven Vision-Language Pretraining

AAAI 2023 Tutorial
Time: Feb 8, 2pm-6pm EST
Location: Room 201, Walter E. Washington Convention Center, Washington DC, USA
Zoom: https://underline.io/events/389/sessions?eventSessionId=14162
Manling Li, Xudong Lin, Jie Lei, Mohit Bansal, Heng Ji, Shih-Fu Chang
Contact: manling2@illinois.edu, xudong.lin@columbia.edu, jielei@cs.unc.edu, mbansal@cs.unc.edu, hengji@illinois.edu, sc250@columbia.edu

About

Recent years witness the great success of vision-language (V+L) pretraining models in multimedia applications by learning the alignments between vision and text. The understanding of entity knowledge (i.e., objects and object types) is the fundamental ability for a wide variety of V+L tasks, such as image captioning and visual question answering. They also require the capability of understanding relational knowledge (i.e., scene graphs), which can further support compositional visual question answering, scene graph parsing, etc. On top of that, event knowledge (i.e., event types, actions, activities) with event argument structures (i.e., entities involved and their semantic roles) are critical to support cognition-level visual understanding, such as visual commonsense reasoning, situation recognition, action recognition and human object interaction. To track status changes of events and entities, procedural knowledge is induced for video question answering, action recognition, action segmentation, action localization, action prediction and procedural planning. Instead of explicitly gaining structured knowledge, the knowledge in language models can also benefit vision-language pretraining. Consequently, adding knowledge into vision-language pretraining poses two key challenges, obtaining knowledge at multiple levels, and encoding the structure and semantics of the knowledge.

Figure 1: In this tutorial, we will present advanced vision-language methods that incorporate knowledge from a variety of sources.

In this tutorial, we will comprehensively review existing paradigms for multimedia knowledge discovery and encoding, and focus on their contributions to vision-language pretraining. We categorize the knowledge into internal self-knowledge and external knowledge. Internal knowledge are extracted from text and vision modalities, such as structured entities, relations, events, and event procedures. We will focus on the structural aspects of the knowledge and address two key challenges regarding the acquisition of knowledge and encoding of structure across multiple modalities. External knowledge can be obtained from knowledge bases or language models, and we will exemplify their use to assist in commonsense understanding of vision modalities, with a focus on the temporal and cognitive aspects. The objective of this tutorial is to introduce participants to recent trends and emerging challenges in knowledge-driven vision-language research, as well as learning resources and tools for participants to obtain ready-to-use models, prompting thorough discussions regarding the impact of structured knowledge on text and vision learning.

Tutorial Schedule

Session	Duration	Time	Presenter	Link
Motivation	15min	2:00-2:15	Manling	Slides , Videos (to appear)
Factual Knowledge: Information about Instances	35min	2:15-2:50	Manling	Slides , Videos (to appear)
Common Knowledge: Commonsense Knowledge	10min	2:50-3:00	Manling	Slides , Videos (to appear)
Common Knowledge: Procedural Knowledge	35min	3:00-3:35	Xudong	Slides , Videos (to appear)
Break	25min	3:35-4:00
Model Knowledge: Knowledge from Language Models and Other Models	30min	4:00-4:30	Jie	Slides , Videos (to appear)
Panel: Knowledge vs Large Models - Does knowledge still have values in the era of large-scale pretraining?	30min	4:30-5:00	Mohit, Shih-Fu, Heng, Jie, Manling (Moderator)	Videos (to appear)
Panel: What is missing in cross-modal knowledge learning?	15min	5:00-5:15	Shih-Fu, Mohit, Xudong, Manling (Moderator)	Videos (to appear)
Panel: Open Challenges?	15min	5:15-5:30	Heng, Shih-Fu, Mohit, Manling (Moderator)	Videos (to appear)
Q&A	30min	5:30-6:00	Shih-Fu, Heng, Mohit, Xudong, Jie, Manling	Videos (to appear)