TY - JOUR
T1 - Machine learning enables identification of an alternative yeast galactose utilization pathway
AU - Harrison, Marie Claire
AU - Ubbelohde, Emily J.
AU - LaBella, Abigail L.
AU - Opulente, Dana A.
AU - Wolters, John F.
AU - Zhou, Xiaofan
AU - Shen, Xing Xing
AU - Groenewald, Marizeth
AU - Hittinger, Chris Todd
AU - Rokas, Antonis
N1 - Publisher Copyright:
© 2024 the Author(s).
PY - 2024/4/30
Y1 - 2024/4/30
N2 - How genomic differences contribute to phenotypic differences is a major question in biology. The recently characterized genomes, isolation environments, and qualitative patterns of growth on 122 sources and conditions of 1,154 strains from 1,049 fungal species (nearly all known) in the yeast subphylum Saccharomycotina provide a powerful, yet complex, dataset for addressing this question. We used a random forest algorithm trained on these genomic, metabolic, and environmental data to predict growth on several carbon sources with high accuracy. Known structural genes involved in assimilation of these sources and presence/absence patterns of growth in other sources were important features contributing to prediction accuracy. By further examining growth on galactose, we found that it can be predicted with high accuracy from either genomic (92.2%) or growth data (82.6%) but not from isolation environment data (65.6%). Prediction accuracy was even higher (93.3%) when we combined genomic and growth data. After the GALactose utilization genes, the most important feature for predicting growth on galactose was growth on galactitol, raising the hypothesis that several species in two orders, Serinales and Pichiales (containing the emerging pathogen Candida auris and the genus Ogataea, respectively), have an alternative galactose utilization pathway because they lack the GAL genes. Growth and biochemical assays confirmed that several of these species utilize galactose through an alternative oxidoreductive D-galactose pathway, rather than the canonical GAL pathway. Machine learning approaches are powerful for investigating the evolution of the yeast genotype–phenotype map, and their application will uncover novel biology, even in well-studied traits.
AB - How genomic differences contribute to phenotypic differences is a major question in biology. The recently characterized genomes, isolation environments, and qualitative patterns of growth on 122 sources and conditions of 1,154 strains from 1,049 fungal species (nearly all known) in the yeast subphylum Saccharomycotina provide a powerful, yet complex, dataset for addressing this question. We used a random forest algorithm trained on these genomic, metabolic, and environmental data to predict growth on several carbon sources with high accuracy. Known structural genes involved in assimilation of these sources and presence/absence patterns of growth in other sources were important features contributing to prediction accuracy. By further examining growth on galactose, we found that it can be predicted with high accuracy from either genomic (92.2%) or growth data (82.6%) but not from isolation environment data (65.6%). Prediction accuracy was even higher (93.3%) when we combined genomic and growth data. After the GALactose utilization genes, the most important feature for predicting growth on galactose was growth on galactitol, raising the hypothesis that several species in two orders, Serinales and Pichiales (containing the emerging pathogen Candida auris and the genus Ogataea, respectively), have an alternative galactose utilization pathway because they lack the GAL genes. Growth and biochemical assays confirmed that several of these species utilize galactose through an alternative oxidoreductive D-galactose pathway, rather than the canonical GAL pathway. Machine learning approaches are powerful for investigating the evolution of the yeast genotype–phenotype map, and their application will uncover novel biology, even in well-studied traits.
KW - AI
KW - fungal evolution
KW - GAL pathway
KW - galactitol
KW - primary metabolism
UR - http://www.scopus.com/inward/record.url?scp=85191619709&partnerID=8YFLogxK
U2 - 10.1073/pnas.2315314121
DO - 10.1073/pnas.2315314121
M3 - Article
C2 - 38669185
AN - SCOPUS:85191619709
SN - 0027-8424
VL - 121
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
IS - 18
M1 - e2315314121
ER -