TbasCO: Trait-based Comparative ’Omics Identifies Ecosystem-Level and Niche- Differentiating Adaptations of an Engineered Microbiome

A grand challenge in microbial ecology is disentangling the traits of individual populations within complex communities. Various cultivation-independent approaches have been used to infer traits based on the presence of marker genes. However, marker genes are not linked to traits with complete fidelity, nor do they capture important attributes, such as the timing of expression or coordination among traits. To address this, we present an approach for assessing the trait landscape of microbial communities by statistically defining a trait attribute as shared transcriptional pattern across multiple organisms. Leveraging the KEGG pathway database as a trait library and the Enhanced Biological Phosphorus Removal (EBPR) model microbial ecosystem, we demonstrate that a majority (65%) of traits present in 10 or more genomes have niche-differentiating expression attributes. For example, while 14 genomes containing the high-affinity phosphorus transporter pstABCS display a canonical attribute (e.g. up-regulation under phosphorus starvation), we identified another attribute shared by 11 genomes where transcription was highest under high phosphorus conditions. Taken together, we provide a novel framework for revealing hidden metabolic versatility when investigating genomic data alone by assigning trait-attributes through genome-resolved time-series metatranscriptomics.

INTRODUCTION 52 8 random selection, or the highest scoring pair. In the latter case, a correction for multiple 167 testing is implemented. This process is repeated N-times, where N equals the number of 168 genes in any given trait. The background distribution for traits is determined by first 169 randomly sampling two genomes, identifying the overlap in annotations, and finally 170 artificially defining a trait containing N annotations. For each annotation in the trait, the 171 distances are calculated between genome A and genome B, as described in the previous 172 section. As modules vary in size, this process is repeated for traits of different sizes. 173

Identifying Attributes 174
TbasCO provides both a cluster-based and pair-wise approach to identify 175 attributes. In both methods, the distance between expression patterns of a trait between 176 two genomes is first calculated based on a composite Z score of the PC and NRED for 177 each gene composing the trait. In the cluster-based analysis, the distances are 178 subsequently clustered using the Louvain clustering algorithm to identify trait attributes. 179 To determine if an attribute is significantly similar or not, a one-sided T-test between the 180 attribute and the random background distribution of traits is conducted. This is done for 181 both cluster-based and model-based comparisons. Many traits are complex and 182 represented in databases such as KEGG by numerous alternative routes. To deal with 183 this complexity, each pathway is expanded into the Disjunctive Normative Form (DNF). 184 Due to the extremely high number of DNFs for some traits, attributes are pruned based 185 on a strict requirement of 100% completion. 186

Distance Calculations 187
To determine the similarity in expression patterns between genes, two distance 188 metrics are calculated: the PC between RNAseq counts across samples, and the NRED, 189 where ranks are a measure of relative abundance of RNA in each sample, normalized 190 the abundance of RNA in the corresponding genome. These distance scores are 191 converted to Z scores using a background distribution of distances between randomly 192 sampled genes as previously described. To determine statistically significant similarities 193 between the expression patterns of a trait between two genomes, a composite distance 194 score is calculated based on the distance between genes in two different genomes. For 195 each of these genes the PC and NRED are calculated and transformed to Z scores and 196 combined as (-1*PC + NRED). The distance of the trait between two genomes is defined 197 as the average of these composite distance scores, and then normalized by the Jaccard 198 distance between these genomes. 199  Table 1, Supplementary Figures 2-4). 217 genes that together form a trait (Figure 1). First, our results showed that there is 309 statistically significant transcriptional conservation of genes at the community level; genes 310 that share an annotation were significantly more similar than expected using two different 311 distance metrics (NRED: p-value < 2.2e-16, PC: p-value < 2.2e-16). Extending this 312 statistical analysis to the trait level, we identified 1674 attributes distributed across the 66 313 genomes. On average, we identified 9.12 genomes per attribute (SD -5.22), with a 314 minimum of 3 genomes and a maximum of 35 ( Figure 3A). Based on these statistics, we 315 defined redundant attributes as those two standard deviations above the mean (19 316 genomes). With this cutoff applied, we identified 79 redundant trait attributes mostly 317 belonging to pathways among carbohydrate metabolism, purine metabolism, and fatty 318 acid metabolism categories (Table 2). Of 290 traits, we identified 97 traits with two or 319 more attributes identified (33%). Of these, traits in 10 or more genomes were twice as 320 likely to have two or more attributes (65%), suggesting that divergent expression patterns 321 for a trait are common, and may represent a niche-differentiating feature ( Figure 3A). 322

Data and Code Availability
Henceforth, when multiple attributes are identified for a trait, we refer to these as niche-323 differentiating attributes. 324 From the ecosystem perspective, a clear phylogenetic signal is observed in the 325 distribution of attributes, as genomes cluster together by shared trait attributes by phylum 15 and Proteobacteria clustering together, respectively ( Figure 3B). For simplicity, we filtered 328 the network to only include nodes with more than 5 connections. Highly redundant trait 329 attributes belonged to modules in the lipid metabolism, energy metabolism, and 330 nucleotide metabolism KEGG functional categories. In contrast, more specialized trait 331 attributes on the periphery of the network or amongst group-specific clusters such as 332 within the Actinobacteria or subsets of the Proteobacteria belonged to amino acid 333 metabolism, biosynthesis of terpenoids and polyketides, metabolism of cofactors and 334 vitamins, and carbohydrate metabolism KEGG modules. Pathways of note that showed 335 a high level of redundancy include the TCA cycle, isoleucine biosynthesis, acyl-CoA 336 synthesis, threonine biosynthesis, and cytochrome c oxidase activity (Table 2)  As noted previously, one of the most striking findings is that a majority, 65% of 357 traits present in 10 or more genomes have multiple expression attributes. Thus, it seems 358 that while the presence of marker genes suggests many organisms share a particular 359 trait, the presence of niche-differentiating expression profiles suggest an alternative story, 360 that there is a level of hidden metabolic diversity. whereas the Accumulibacter clade IA has a non-canonical expression (as in Figure 4, 394 Attribute 2) [31]. By assigning trait attributes, we are able to extend these findings beyond 395 Accumulibacter to other flanking community members in the SBR ecosystem suggesting 396 that there are conserved ecological pressures driving niche differentiating expression 397 patterns in pstABCS within the EBPR community. 398 Bacteroidetes (BAC1). The attribute identified for napAB was also more highly expressed 417 anaerobically and included CAPIA, CAPIIA, ALIC1, REYR2, RUBRI1, and BEIJ3. 418

Community Members
Interestingly, this napAB attribute had expression patterns that quickly decreased in the 419 first aerobic time point, suggesting a tighter regulation than Attribute 1 for narGH. 420 Together, this suggests that the regulation of denitrification within the EBPR ecosystem 421 is a niche-differentiating feature whereby the induction of denitrification pathways occurs 422 either anaerobically or only after anaerobic carbon contact. 423 A smaller cohort contained the genetic repertoire to reduce nitrite to nitrogen gas 424 and exhibited hallmark anaerobic-aerobic expression patterns ( Figure 5E We next analyzed the distribution of trait-attributes of vitamin and amino acid 499 pathways among these genomes to understand how these biosynthetic pathways are 500 expressed similarly or differently in the EBPR SBR ecosystem ( Figure 6B and C). 501 Members of the Proteobacteria containing thiamine and cobalamin biosynthetic pathways 502 all express these traits similarly ( Figure 6B). However, the pantothenate synthesis 503 pathway contains two trait-attributes and is expressed differently among two cohorts. In 504 the first attribute, RUN1, TET1, CAULO1, CAPIA, and PSEUDO1 express the 505 pantothenate pathway similarly. However, OBS1 and TET2 express the pantothenate 506 pathway differently ( Figure 6B). Because tetrahydrofolate can be synthesized through 507 different metabolic routes, we analyzed the differences in trait attribute expression for all 508 routes in genomes that contained sufficient coverage of this trait. Members of the 23 Expression of various groups of amino acids show more differentiated patterns of 512 expression for genomes with these pathways. Several amino acids also contain different 513 metabolic routes for biosynthesis, and we analyzed all trait attributes for each amino acid 514 for all routes grouped by type ( Figure 6C). For the charged amino acids arginine, histidine, 515 and lysine, members of the Proteobacteria and Bacteroidetes cluster within their 516 phylogenetic groups, respectively, with lysine and histidine expressed differently among 517 these groups ( Figure 6C). In contrast, arginine is expressed similarly among all 518 Proteobacteria genomes. Among the polar charged amino acids, TET2 is the only 519 genome among the top 15 genomes that contains the metabolic pathway to synthesize 520 serine ( Figure 6A). Several groups contain the pathway for threonine synthesis, and 521 expression of different threonine routes are differentiated among the Proteobacteria, 522 Bacteroidetes, and Tetrasphaera spp., but mostly clusters phylogenetically ( Figure 6C). 523 Notably, the expression patterns for the cysteine and proline biosynthetic pathways do 524 not cluster phylogenetically, such as both Tetrasphaera genomes expressing the proline 525 pathway more similarly to other Proteobacteria and Bacteroidetes ( Figure 6C). The few 526 lineages that can synthesize tyrosine and phenylalanine (CAPIA, CAPIIA, RAM1, 527 PSEUDO1) show different patterns of expression. These results show that beyond the 528 presence or absence of key vitamin cofactor and amino acid biosynthetic pathways, 529 EBPR SBR organisms also display coherent and differentiated patterns of expression for 530 these traits, of which the functional consequences remain to be further understood. 531 532 533 534

CONCLUSIONS AND FUTURE PERSPECTIVES 535
In this work, we applied a novel trait-based 'omics pipeline to a semi-complex, 536 engineered bioreactor microbial community to explore ecosystem-level and niche-537 differentiating traits. Through assembling high-quality MAGs of the EBPR SBR 538 community and using a time-series metatranscriptomics experiment, we were able to 539 extend functional predictions and ecosystem inferences beyond hypotheses made from 540 gene presence/absence data. Using our novel trait-based comparative 'omics pipeline, 541 we identified how similarities and differences in the expression of significant EBPR traits 542 are conferred among community members such as phosphorus cycling, denitrification, 543 and amino acid metabolism. Specifically, we demonstrate that traits with similar 544 expression profiles may be clustered into attributes providing a new layer to trait-based 545 approaches. 546 We believe that identifying expression-based attributes will be a powerful tool to 547 explore microbial traits in natural, engineered, and host-associated microbiomes. Outside