Abstract
High-throughput screenings have enabled the wide analysis and large data sets in a short time. Processing data and identifying correlated relationships is possible with existing machine learning techniques, but it remains a challenge to elucidate causal insights. In this project, a metabolic model of Trichoderma reesei RUT-C30 and machine learning were used to develop two machine learning models (ML), a linear elastic network model and a multilayer neural network model. By simulating the metabolic model under growth conditions and using this information as an input data to the ML models, provided preliminary insights into causal relationships between T. reesei total protein concentration and different growth conditions.
For the construction of the ML models, growth and total protein concentration data were collected by performing high-throughput screenings on Phenotype MicroArrayTM (PM) plates pre-filled with different carbon and nitrogen sources. In addition, pH-adjusted media cultivations were performed with different combinations of carbon and nitrogen sources. Growth in similar conditions was then simulated with a metabolic model of T. reesei RUT-C30 and flux balance analysis using the Cobrapy Python package. Linear as well as multilinear models between simulated metabolic fluxes and measured protein concentration were constructed using algorithms from the scikit-learn, Tensorflow and Keras python libraries, respectively. The scalability of the observed relationships between protein production and environmental conditions was tested by performing bioreactor cultures in 1L reactors using selected nutrient sources. Total protein concentration and biomass as dry weight were determined from the cultures. The used conditions in bioreactors were simulated with the metabolic model and the protein concentration predictions of the constructed machine learning models were compared with the determined total protein concentrations. The correlation between predicted and measured protein concentration was 0.748 for the elastic net method and 0.787 for the neural network method. The correlation between the models were 0.890, i.e., their predictive ability was close to each other. Given the differences in growth conditions between small-scale and bioreactor cultures, the correlations were surprisingly high. Thus, active metabolic pathways appear to be an important factor in determining protein production.
According to both models, the most important factors positively affecting protein production were the transport reactions of glutamate and trehalose, and the metabolic fluxes involved in amino acid and carbohydrate metabolism, such as glycolysis. By uncovering causal relationships, it would be possible to select targets that, by modification, could improve protein production and making bioprocess more cost-effective. Protein production in cells is important as proteins are widely required in various industrial processes, such as food, textiles, biofuels, and pharmaceuticals.
For the construction of the ML models, growth and total protein concentration data were collected by performing high-throughput screenings on Phenotype MicroArrayTM (PM) plates pre-filled with different carbon and nitrogen sources. In addition, pH-adjusted media cultivations were performed with different combinations of carbon and nitrogen sources. Growth in similar conditions was then simulated with a metabolic model of T. reesei RUT-C30 and flux balance analysis using the Cobrapy Python package. Linear as well as multilinear models between simulated metabolic fluxes and measured protein concentration were constructed using algorithms from the scikit-learn, Tensorflow and Keras python libraries, respectively. The scalability of the observed relationships between protein production and environmental conditions was tested by performing bioreactor cultures in 1L reactors using selected nutrient sources. Total protein concentration and biomass as dry weight were determined from the cultures. The used conditions in bioreactors were simulated with the metabolic model and the protein concentration predictions of the constructed machine learning models were compared with the determined total protein concentrations. The correlation between predicted and measured protein concentration was 0.748 for the elastic net method and 0.787 for the neural network method. The correlation between the models were 0.890, i.e., their predictive ability was close to each other. Given the differences in growth conditions between small-scale and bioreactor cultures, the correlations were surprisingly high. Thus, active metabolic pathways appear to be an important factor in determining protein production.
According to both models, the most important factors positively affecting protein production were the transport reactions of glutamate and trehalose, and the metabolic fluxes involved in amino acid and carbohydrate metabolism, such as glycolysis. By uncovering causal relationships, it would be possible to select targets that, by modification, could improve protein production and making bioprocess more cost-effective. Protein production in cells is important as proteins are widely required in various industrial processes, such as food, textiles, biofuels, and pharmaceuticals.
Original language | English |
---|---|
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 29 Apr 2024 |
Publisher | |
Publication status | Published - 2024 |
MoE publication type | G2 Master's thesis, polytechnic Master's thesis |
Keywords
- Trichoderma reesei
- RUT-C30
- machine learning
- metabolic modelling