Cross-lingual Resources

About

We include several cross-lingual resources here, such as multilingual embeddings and parallel corpora.

Acknowledgement

This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), and Intelligence Advanced Research Projects Activity (IARPA) via contract FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Multilingual Entity Linker for 282 Languages

Parallel Corpora

The parallel corpus files can be accessed here. Check the table below for the mapping between language pairs and filenames.

Filenames

Language I	Language II	Filename
Amharic	English	am-en.json
Arabic	English	ar-en.json
Bangia	English	bn-en.json
Chinese	English	zh-en.json
Hausa	English	ha-en.json
Hungarian	English	hu-en.json
Persian	English	fa-en.json
Russian	English	ru-en.json
Somali	English	so-en.json
Spanish	English	es-en.json
Tamil	English	ta-en.json
Thai	English	th-en.json
Turkish	English	tr-en.json
Uyghur	English	ug-en.json
Urdu	English	ur-en.json
Uzbek	English	uz-en.json
Vietnamese	English	vi-en.json
Yoruba	English	yo-en.json

Embeddings

We provide aligned cross-lingual embeddings for over 200 langugages. The complete set of embeddings is accessible HERE. Please refer to THIS PAGE for the mapping between acronyms and languages.