Tutorial:
What is the value of knowledge in the era of large-scale pretraining?
Knowledge-Driven Vision-Language Pretraining
AAAI 2023 Tutorial
Time: Feb 8, 2pm-6pm EST
Location: Room 201, Walter E. Washington Convention Center, Washington DC, USA
Zoom: https://underline.io/events/389/sessions?eventSessionId=14162
Manling Li, Xudong Lin, Jie Lei, Mohit Bansal, Heng Ji, Shih-Fu Chang
Contact: manling2@illinois.edu, xudong.lin@columbia.edu, jielei@cs.unc.edu, mbansal@cs.unc.edu, hengji@illinois.edu, sc250@columbia.edu
About
Recent years witness the great success of vision-language (V+L) pretraining models in multimedia applications by learning the alignments between vision and text. The understanding of entity knowledge (i.e., objects and object types) is the fundamental ability for a wide variety of V+L tasks, such as image captioning and visual question answering. They also require the capability of understanding relational knowledge (i.e., scene graphs), which can further support compositional visual question answering, scene graph parsing, etc. On top of that, event knowledge (i.e., event types, actions, activities) with event argument structures (i.e., entities involved and their semantic roles) are critical to support cognition-level visual understanding, such as visual commonsense reasoning, situation recognition, action recognition and human object interaction. To track status changes of events and entities, procedural knowledge is induced for video question answering, action recognition, action segmentation, action localization, action prediction and procedural planning. Instead of explicitly gaining structured knowledge, the knowledge in language models can also benefit vision-language pretraining. Consequently, adding knowledge into vision-language pretraining poses two key challenges, obtaining knowledge at multiple levels, and encoding the structure and semantics of the knowledge.

In this tutorial, we will comprehensively review existing paradigms for multimedia knowledge discovery and encoding, and focus on their contributions to vision-language pretraining. We categorize the knowledge into internal self-knowledge and external knowledge. Internal knowledge are extracted from text and vision modalities, such as structured entities, relations, events, and event procedures. We will focus on the structural aspects of the knowledge and address two key challenges regarding the acquisition of knowledge and encoding of structure across multiple modalities. External knowledge can be obtained from knowledge bases or language models, and we will exemplify their use to assist in commonsense understanding of vision modalities, with a focus on the temporal and cognitive aspects. The objective of this tutorial is to introduce participants to recent trends and emerging challenges in knowledge-driven vision-language research, as well as learning resources and tools for participants to obtain ready-to-use models, prompting thorough discussions regarding the impact of structured knowledge on text and vision learning.
Tutorial Schedule
Session | Duration | Time | Presenter | Link |
---|---|---|---|---|
Motivation | 15min | 2:00-2:15 | Manling | Slides , Videos (to appear) |
Factual Knowledge: Information about Instances | 35min | 2:15-2:50 | Manling | Slides , Videos (to appear) |
Common Knowledge: Commonsense Knowledge | 10min | 2:50-3:00 | Manling | Slides , Videos (to appear) |
Common Knowledge: Procedural Knowledge | 35min | 3:00-3:35 | Xudong | Slides , Videos (to appear) |
Break | 25min | 3:35-4:00 | ||
Model Knowledge: Knowledge from Language Models and Other Models | 30min | 4:00-4:30 | Jie | Slides , Videos (to appear) |
Panel: Knowledge vs Large Models - Does knowledge still have values in the era of large-scale pretraining? | 30min | 4:30-5:00 | Mohit, Shih-Fu, Heng, Jie, Manling (Moderator) | Videos (to appear) |
Panel: What is missing in cross-modal knowledge learning? | 15min | 5:00-5:15 | Shih-Fu, Mohit, Xudong, Manling (Moderator) | Videos (to appear) |
Panel: Open Challenges? | 15min | 5:15-5:30 | Heng, Shih-Fu, Mohit, Manling (Moderator) | Videos (to appear) |
Q&A | 30min | 5:30-6:00 | Shih-Fu, Heng, Mohit, Xudong, Jie, Manling | Videos (to appear) |
Materials
Presenters
-
Manling Li
UIUC
-
Xudong Lin
Columbia University
-
Jie Lei
UNC; Meta AI (incoming)
-
Mohit Bansal
UNC
-
Heng Ji
UIUC
-
Shih-Fu Chang
Columbia University
Contact
Please contact Manling Li (manling2@illinois.edu) if you have any questions.