MolT5: Translation between Molecules and Natural Language

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji
Contact: hengji@illinois.edu, cne2@illinois.edu

Please email Carl Edwards if you experience any technical issues using our software or need further information.

About

We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.

Figure: An example of both the image captioning task (Chen et al., 2015) and molecule captioning. Molecule captioning is considerably more difficult because of the increased linguistic variety in possible captions.

Software

All code and data for MolT5 can be accessed and downloaded at https://github.com/blender-nlp/MolT5.

Datasets

ChEBI-20 (txt format)
ZINC (txt format)
ChEBI-20 (tfrecords format)

Evaluation

We provide evaluation code for these new tasks on Github.

Installation

The requirements for the evaluation code conda environment are in environment_eval.yml. An environment can be created using the following commands:
conda env create -n MolTextTranslationEval -f environment_eval.yml python=3.9 conda activate MolTextTranslationEval python -m spacy download en_core_web_sm pip install git+https://github.com/samoturk/mol2vec

Required Downloads for Text2Mol Metric

test_outputfinal_weights.320.pt should be placed in "evaluation/t2m_output". It can be downloaded using curl -L https://uofi.box.com/shared/static/es16alnhzfy1hpagf55fu48k49f8n29x --output test_outputfinal_weights.320.pt

If GitHub LFS fails:

cid_to_smiles.pkl

Input format

The input format should be a tab-separated txt file with three columns and the header

'SMILES	ground truth	output'

for smiles2caption or

'description	ground truth	output'

for caption2smiles. Please see example input in the code repository.

Evaluation Commands

Code	Evaluation
Evaluating SMILES to Caption
python text_translation_metrics.py --input_file smiles2caption_example.txt	Evaluate all NLG metrics.
python text_text2mol_metric.py --input_file smiles2caption_example.txt	Evaluate Text2Mol metric for caption generation.
python text_text2mol_metric.py --use_gt	Evaluate Text2Mol metric for the ground truth.
Evaluating Caption to SMILES
python mol_translation_metrics.py --input_file caption2smiles_example.txt	Evaluate BLEU, Exact match, and Levenshtein metrics.
python fingerprint_metrics.py --input_file caption2smiles_example.txt	Evaluate fingerprint metrics.
./mol_text2mol_metric.sh caption2smiles_example.txt	Evaluate Text2Mol metric for molecule generation.
python mol_text2mol_metric.py --use_gt	Evaluate Text2Mol metric for the ground truth.
python fcd_metric.py --input_file caption2smiles_example.txt	Evaluate FCD metric for molecule generation.

HuggingFace model checkpoints

All of our HuggingFace checkpoints are located here.

Pretrained MolT5-based checkpoints include:

molt5-small (~77 million parameters)
molt5-base (~250 million parameters)
molt5-large (~800 million parameters)

You can also easily find our fine-tuned caption2smiles and smiles2caption models. For example, molt5-large-smiles2caption is a molt5-large model that has been further fine-tuned for the task of molecule captioning (i.e., smiles2caption).

Example usage for molecule captioning (i.e., smiles2caption):


from transformers import T5Tokenizer, T5ForConditionalGeneration 
 


tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-smiles2caption", model_max_length=512) 

model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-smiles2caption') 
 


input_text = 'C1=CC2=C(C(=C1)[O-])NC(=CC2=O)C(=O)O' 

input_ids = tokenizer(input_text, return_tensors="pt").input_ids 


outputs = model.generate(input_ids, num_beams=5, max_length=512) 

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example usage for molecule generation (i.e., caption2smiles):


from transformers import T5Tokenizer, T5ForConditionalGeneration 
 


tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-caption2smiles", model_max_length=512) 

model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-caption2smiles') 
 


input_text = 'The molecule is a monomethoxybenzene that is 2-methoxyphenol substituted by a hydroxymethyl group at position 4. It has a role as a plant metabolite. It is a member of guaiacols and a member of benzyl alcohols.' 

input_ids = tokenizer(input_text, return_tensors="pt").input_ids 
 


outputs = model.generate(input_ids, num_beams=5, max_length=512) 

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgement

We would like to thank Martin Burke for his helpful discussion. This research is based upon work supported by the Molecule Maker Lab Institute: an AI research institute program supported by NSF under award No. 2019897 and No. 2034562. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

References

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji. Translation between Molecules and Natural Language. EMNLP 2022.

@inproceedings{edwards-etal-2022-translation,
    title = "Translation between Molecules and Natural Language",
    author = "Edwards, Carl  and
      Lai, Tuan  and
      Ros, Kevin  and
      Honke, Garrett  and
      Cho, Kyunghyun  and
      Ji, Heng",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.26",
    doi = "10.18653/v1/2022.emnlp-main.26",
    pages = "375--413",
    abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.",
}