M2E2: Multimedia Event Extraction

Manling Li*, Alireza Zareian*, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang (* equal contribution)
Contact: manling2@illinois.edu, az2407@columbia.edu, hengji@illinois.edu, sc250@columbia.edu


Traditional event extraction methods target a single modality, such as text, images or videos. However, the practice of contemporary journalism distributes news via multimedia. Motivated by the complementary and holistic nature of multimedia data, we propose MultiMedia Event Extraction (M2E2), a new task that aims to jointly extract events and arguments from multiple modalities.

Figure 1: An example of Multimedia Event Extraction. An event mention and some event arguments (Agent and Person) are extracted from text, while the vehicle arguments can only be extracted from the image.

Task Definition

The input of Multimedia Event Extraction is a multimedia news document, which consists of a set of images and a set of sentences. The objective is twofold:

  • Event Extraction aims to identify a set of events and classify them into pre-defined event types. Each event is grounded on a text trigger word (i.e., text-only event), or an image (i.e., image-only event), or both (i.e., multimedia event). For example, there are three movement.transport events in Figure 1. The first two are text-only events, with "visited" and "visit" as event mentions. The third movement.transport event is a multimedia event, with "deploy" as text event mention, and the image as visual event mention.
  • Event Argument Extraction extracts a set of arguments for each event. Each argument is classfied to an argument role type, and is grounded on a text entity (represented as a text span) or an image object (represented as a bounding box), or both. Take the last movement.transport event in Figure 1 as an example. The two bounding boxes are arguments of type vehicle, and the text entity "United States" is agent, while the text entity "soldiers" is artifact.


We construct the first benchmark and evaluation dataset for this task, which consists of 245 fully annotated news articles.

Download M2E2        Annotation Guideline (v0.1)

Our Approach

We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space.

Figure 2: Approach overview. During training (left), we jointly train three tasks to establish a cross-media structured embedding space. During test (right), we jointly extract events and arguments from multimedia articles.

As shown in Figure 2, the training phase contains three tasks: text event extraction, visual situation recognition, and cross-media alignment. We learn a cross-media shared encoder, a shared event classifier, and a shared argument classifier. In the testing phase, given a multimedia news article, we encode the sentences and images into the structured common space, and jointly extract textual and visual events and arguments, followed by cross-modal coreference resolution.

Figure 3: Multimedia structured common space construction. Red pixels stands for attention heatmap.

To construct structued common space, we represent each image or sentence as a graph, where each node represents an event or entity and each edge represents an argument role. The node and edge embeddings are represented in a multimedia common semantic space, as they are trained to resolve event co-reference across modalities and to match images with relevant sentences. This enables us to jointly classify events and argument roles from both modalities.

  WASE Paper           WASE Code  


This research is based upon work supported in part by U.S. DARPA AIDA Program No. FA8750-18-2-0014 and KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.


Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang. 2020. Cross-media Structured Common Space for Multimedia Event Extraction. Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics.

@inproceedings{li2020multimediaevent, title={Cross-media Structured Common Space for Multimedia Event Extraction}, author={Manling Li and Alireza Zareian and Qi Zeng and Spencer Whitehead and Di Lu and Heng Ji and Shih-Fu Chang}, booktitle={Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics}, year={2020} }