Language Models for PHA design (LaMP): Inverse PHA design from property prediction to conditional molecule generation

Research output: ThesisMaster's thesis

Abstract

Due to the significant increase in the use of petroleum-based plastics and their harmful impact on the environment, the replacement of plastics with more environmentally responsible and sustainable alternatives has become increasingly inevitable. To address the respective materials development needs, polyhydroxyalkanoates (PHAs), a class of biosynthetic, biodegradable polymers, have received great attention as a sustainable alternative to petroleum-based plastics. With over 160 identified monomers, PHAs offer diverse possibilities for designing bioplastics tailored to specific applications. Their thermal, mechanical, and chemical properties, coupled with biodegradability and biocompatibility, make PHAs an attractive alternative.

However, despite their enormous potential, the number of combinatorial possibilities of PHAs is so great that it is almost impossible to investigate these compounds on a case-by-case basis. Moreover, given the vast chemical space, it is difficult to design and develop new PHA-based polymers with targeted properties for a wide range of applications. One possible solution is inverse molecular design, which uses deep learning techniques for de novo generation of molecules at the starting point of desired properties. SMILES (Simplified Molecular Input Line Entry System) molecular notation represents molecules as strings, allowing pretrained large-scale language models to be applied to molecular design.

This thesis introduces Language Models for PHA design (LaMP), developed language models for effectively discovering new PHAs in terms of its properties for specific applications. Pretrained language models such as RoBERTa and GPT2 are fine-tuned to predict properties of PHAs, and generate Molecules using GPT2, and lastly combine prediction and generation models into cVAE structure (conditional Variational Autoencoder) to build an end-to-end model of conditional generation. Our results show that the language models are capable of learning PHAs’ structure-property relationships, and further generate new PHA based on targeted properties.
Original languageEnglish
QualificationMaster Degree
Awarding Institution
  • Aalto University
Supervisors/Advisors
  • Laukkanen, Anssi, Advisor
  • Garg, Vikas Kumar, Supervisor, External person
Award date22 Jan 2024
Publisher
Publication statusPublished - 22 Jan 2024
MoE publication typeG2 Master's thesis, polytechnic Master's thesis

Fingerprint

Dive into the research topics of 'Language Models for PHA design (LaMP): Inverse PHA design from property prediction to conditional molecule generation'. Together they form a unique fingerprint.

Cite this