M2E2: Multimedia Event Extraction
Manling Li*, Alireza Zareian*, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang (* equal contribution)
Contact: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com
Traditional event extraction methods target a single modality, such as text, images or videos. However, the practice of contemporary journalism distributes news via multimedia. Motivated by the complementary and holistic nature of multimedia data, we propose MultiMedia Event Extraction (M2E2), a new task that aims to jointly extract events and arguments from multiple modalities.
The input of Multimedia Event Extraction is a multimedia news document, which consists of a set of images and a set of sentences. The objective is twofold:
- Event Extraction aims to identify a set of events and classify them into pre-defined event types. Each event is grounded on a text trigger word (i.e., text-only event), or an image (i.e., image-only event), or both (i.e., multimedia event). For example, there are three movement.transport events in Figure 1. The first two are text-only events, with "visited" and "visit" as event mentions. The third movement.transport event is a multimedia event, with "deploy" as text event mention, and the image as visual event mention.
- Event Argument Extraction extracts a set of arguments for each event. Each argument is classfied to an argument role type, and is grounded on a text entity (represented as a text span) or an image object (represented as a bounding box), or both. Take the last movement.transport event in Figure 1 as an example. The two bounding boxes are arguments of type vehicle, and the text entity "United States" is agent, while the text entity "soldiers" is artifact.
We construct the first benchmark and evaluation dataset for this task, which consists of 245 fully annotated news articles.
We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space.
As shown in Figure 2, the training phase contains three tasks: text event extraction, visual situation recognition, and cross-media alignment. We learn a cross-media shared encoder, a shared event classifier, and a shared argument classifier. In the testing phase, given a multimedia news article, we encode the sentences and images into the structured common space, and jointly extract textual and visual events and arguments, followed by cross-modal coreference resolution.
To construct structued common space, we represent each image or sentence as a graph, where each node represents an event or entity and each edge represents an argument role. The node and edge embeddings are represented in a multimedia common semantic space, as they are trained to resolve event co-reference across modalities and to match images with relevant sentences. This enables us to jointly classify events and argument roles from both modalities.
This research is based upon work supported in part by U.S. DARPA AIDA Program No. FA8750-18-2-0014 and KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang. 2020. Cross-media Structured Common Space for Multimedia Event Extraction. Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics.