MolT5: Translation between Molecules and Natural Language

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji

Please email Carl Edwards if you experience any technical issues using our software or need further information.


We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.

Figure: An example of both the image captioning task (Chen et al., 2015) and molecule captioning. Molecule captioning is considerably more difficult because of the increased linguistic variety in possible captions.


All code and data for MolT5 can be accessed and downloaded at



We provide evaluation code for these new tasks on Github.


The requirements for the evaluation code conda environment are in environment_eval.yml. An environment can be created using the following commands:
conda env create -n MolTextTranslationEval -f
environment_eval.yml python=3.9
conda activate MolTextTranslationEval
python -m spacy download en_core_web_sm
pip install git+

Required Downloads for Text2Mol Metric

  • should be placed in "evaluation/t2m_output". It can be downloaded using curl -L --output
If GitHub LFS fails:

Input format

The input format should be a tab-separated txt file with three columns and the header

'SMILES	ground truth	output'
for smiles2caption or
'description	ground truth	output'
for caption2smiles. Please see example input in the code repository.

Evaluation Commands

Code Evaluation
Evaluating SMILES to Caption
python --input_file smiles2caption_example.txt Evaluate all NLG metrics.
python --input_file smiles2caption_example.txt Evaluate Text2Mol metric for caption generation.
python --use_gt Evaluate Text2Mol metric for the ground truth.
Evaluating Caption to SMILES
python --input_file caption2smiles_example.txt Evaluate BLEU, Exact match, and Levenshtein metrics.
python --input_file caption2smiles_example.txt Evaluate fingerprint metrics.
./ caption2smiles_example.txt Evaluate Text2Mol metric for molecule generation.
python --use_gt Evaluate Text2Mol metric for the ground truth.
python --input_file caption2smiles_example.txt Evaluate FCD metric for molecule generation.

HuggingFace model checkpoints

All of our HuggingFace checkpoints are located here.

Pretrained MolT5-based checkpoints include:

You can also easily find our fine-tuned caption2smiles and smiles2caption models. For example, molt5-large-smiles2caption is a molt5-large model that has been further fine-tuned for the task of molecule captioning (i.e., smiles2caption).

Example usage for molecule captioning (i.e., smiles2caption):
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-smiles2caption", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-smiles2caption')

input_text = 'C1=CC2=C(C(=C1)[O-])NC(=CC2=O)C(=O)O'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, num_beams=5, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example usage for molecule generation (i.e., caption2smiles):
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-caption2smiles", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-caption2smiles')

input_text = 'The molecule is a monomethoxybenzene that is 2-methoxyphenol substituted by a hydroxymethyl group at position 4. It has a role as a plant metabolite. It is a member of guaiacols and a member of benzyl alcohols.'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, num_beams=5, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


We would like to thank Martin Burke for his helpful discussion. This research is based upon work supported by the Molecule Maker Lab Institute: an AI research institute program supported by NSF under award No. 2019897 and No. 2034562. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.


Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji. Translation between Molecules and Natural Language. EMNLP 2022.

@inproceedings{edwards-etal-2022-translation, title = "Translation between Molecules and Natural Language", author = "Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Cho, Kyunghyun and Ji, Heng", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "", doi = "10.18653/v1/2022.emnlp-main.26", pages = "375--413", abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.", }