Knowledge Extraction to Assist Scientific Discovery from Corona Virus Literature

Qingyun Wang, Xuan Wang, Manling Li, Nikolaus Parulian, Heng Ji, Jiawei Han (UIUC)
Shih-Fu Chang (Columbia)
Kyunghyun Cho (NYU)
Contact: hengji@illinois.edu


Download Knowledge Graphs


During this strange time I'm trying to be useful in any sort of format. One of our ongoing efforts is aiming for assisting scientists to speed up their scientific discovery. The framework consists of three steps, and we will be posting the results as soon as each step is finished:

1. Knowledge extraction from scientific papers about corona virus (CORD-19 dataset)

  • 07/05: We wrote up a paper for reference about our current technical components for reference.
  • 06/15: Updated the KGs at the same download link for 25,534 papers.
  • 04/10: start to find some connections between drugs and genes/chemicals related to covid-19! Here are some examples
  • 04/08: added entity type paths in ontologies.
  • 04/01: added event extraction results.
  • (1) Extract entities, relations and events from text following the ontology introduced in the Comparative Toxicogenomics Database; link entity mentions to external biomedical ontologies. See the source texts and the knowledge graph constructed which includes 50,752 Gene nodes, 10,781 Disease nodes, 5,738 Chemical nodes, and 535 Organism nodes. These nodes are connected by 133 relation types including Gene–Chemical–Interaction Relationships, Chemical–Disease Associations, Gene–Disease Associations, Chemical–GO Enrichment Associations and Chemical–Pathway Enrichment Associations. Entities also play some certain roles in 13 Event types, including Gene expression, Transcription, Localization, Protein catabolism , Binding, Protein modification, Phosphorylation , Ubiquitination, Acetylation, Deacetylation, Regulation, Positive regulation, Negative regulation. The Knowledge Graph result zip file contains the following files:
    • entity.zip: source texts and entity extraction results;
    • chem_gene_ixns_relation.csv: relation extraction results between chemicals, genes, and species;
    • chem_gene_ixn_types.csv: relation types;
    • chemicals_diseases_relation.csv: relation extraction results between chemicals and diseases;
    • genes_diseases_relation.csv: relation extraction results between genes and diseases;
    • chemicals.csv: a mapping among a chemical's name, id, and its parent ids;
    • genes.csv: a mapping among a gene's name, symbol, id, and its alternative id;
    • diseases.csv: a mapping among a disease's name, id, its parent ids, and its alternative id.
    • protein_event.zip: event extraction results.
  • (2) knowledge extraction from images, and do cross-media fusion and inference with (1).

2. Link prediction for new hypothesis generation and ranking.

3. Question answering for scientists to search related hypotheses and knowledge graphs, and provide evidence from source text and images.


COVID-19 Datasets

Latest Full Dataset (Update: ): document_parses, metadata, biorxiv, arxiv, comm_use_subset, non_comm_use_subset, custom_license

Date Updated Files Metadata
Data sources:
  • https://www.ncbi.nlm.nih.gov/research/pubtator/index.html
  • https://www.semanticscholar.org/cord19/download

Related Work:


Acknowledgement:

This research is based upon work supported in part by U.S. DARPA KAIROS Program No. FA8750-19-2-1004, U.S. DARPA AIDA Program # FA8750-18-2-0014, U.S. NSF No. 1741634, the Office of the Director of National Intelligence (ODNI), and Intelligence Advanced Research Projects Activity (IARPA) via contract FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.


References:

  • Qingyun Wang, Manling Li, Xuan Wang, Nikolaus Parulian, Guangxing Han, Jiawei Ma, Jingxuan Tu, Ying Lin, Haoran Zhang, Weili Liu, Aabhas Chauhan, Yingjun Guan, Bangzheng Li, Ruisong Li, Xiangchen Song, Heng Ji, Jiawei Han, Shih-Fu Chang, James Pustejovsky, David Liem, Ahmed Elsayed, Martha Palmer, Jasmine Rah, Clare Voss, Cynthia Schneider, Boyan Onyshkevych. 2020. COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation. Proc. The 2021 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT2021) Demo Track.
  • Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji, Xiang Ren, Jiawei Han, Lin Zhao and James Hendler. 2017. Liberal Entity Extraction: Rapid Construction of Fine-Grained Entity Typing Systems. Big Data, Mar 2017, 5(1): 19-31.
  • Diya Li, Lifu Huang, Heng Ji and Jiawei Han. 2019. Biomedical Event Extraction based on Knowledge-driven Tree-LSTM. Proc. 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT2019).
  • Diya Li and Heng Ji. 2019. Syntax-aware Multi-task Graph Convolutional Networks for Biomedical Relation Extraction. Proc. EMNLP2019 Workshop on Health Text Mining and Information Analysis.
  • Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal and Yi Luan. 2019. PaperRobot: Incremental Draft Generation of Scientific Ideas. Proc. The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019).
  • Chih-Hsuan Wei, Alexis Allot, Robert Leaman, Zhiyong Lu. 2019. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W587–W593.
  • Jin Guang Zheng, Daniel Howsmon, Boliang Zhang, Juergen Hahn, Deborah McGuinness, James Hendler and Heng Ji. 2014. Entity Linking for Biomedical Literature. BMC Medical Informatics and Decision Making.