MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters

Abstract With an ever-increasing amount of (meta)genomic data being deposited in sequence databases, (meta)genome mining for natural product biosynthetic pathways occupies a critical role in the discovery of novel pharmaceutical drugs, crop protection agents and biomaterials. The genes that encode these pathways are often organised into biosynthetic gene clusters (BGCs). In 2015, we defined the Minimum Information about a Biosynthetic Gene cluster (MIBiG): a standardised data format that describes the minimally required information to uniquely characterise a BGC. We simultaneously constructed an accompanying online database of BGCs, which has since been widely used by the community as a reference dataset for BGCs and was expanded to 2021 entries in 2019 (MIBiG 2.0). Here, we describe MIBiG 3.0, a database update comprising large-scale validation and re-annotation of existing entries and 661 new entries. Particular attention was paid to the annotation of compound structures and biological activities, as well as protein domain selectivities. Together, these new features keep the database up-to-date, and will provide new opportunities for the scientific community to use its freely available data, e.g. for the training of new machine learning models to predict sequence-structure-function relationships for diverse natural products. MIBiG 3.0 is accessible online at https://mibig.secondarymetabolites.org/.

data being deposited in sequence databases, (meta)genome mining for natural product biosynthetic pathways occupies a critical role in the discovery of novel pharmaceutical drugs, crop protection agents and biomaterials. The genes that encode these pathways are often organised into biosynthetic gene clusters (BGCs). In 2015, we defined the Minimum Information about a Biosynthetic Gene cluster (MIBiG): a standardised data format that describes the minimally required information to uniquely characterise a BGC. We simultaneously constructed an accompanying online database of BGCs, which has since been widely used by the community as a reference dataset for BGCs and was expanded to 2021 entries in 2019 (MIBiG 2.0). Here, we describe MIBiG 3.0, a database update comprising large-scale validation and re-annotation of existing entries and 661 new entries. Particular attention was paid to the annotation of compound structures and biological activities, as well as protein domain selectivities. Together, these new features keep the database upto-date, and will provide new opportunities for the scientific community to use its freely available data, e.g. for the training of new machine learning models to predict sequence-structure-function relationships for diverse natural products. MIBiG 3.0 is accessible online at https://mibig.secondarymetabolites.org/.

INTRODUCTION
Across all kingdoms of life, organisms produce specialised metabolites: molecules that are produced by bacteria, fungi and plants to gain an advantage over their competitors in challenging environments. Specialised metabolites, also referred to as secondary metabolites or natural products, exhibit a wide variety of biological activities, including many that are useful for pharmaceutical and agricultural applications, e.g. antibiotics, anti-cancer drugs, pesticides and herbicides. The production of specialised metabolites is typically encoded by biosynthetic gene clusters (BGCs): groups of co-localised and co-regulated genes that jointly encode a biosynthetic pathway. Therefore, microbial and plant genomes can be mined for novel specialised metabolite production by detecting BGCs and predicting their encoded products and functions. Similar to how the relationship between DNA, mRNA and protein describes the flow of information in cells, we can define a 'central dogma' of specialised metabolism: a BGC sequence encodes a set of enzymes, which together assemble a compound structure (or a cocktail of structural analogues), which in turn dictates specialised metabolite function. Understanding how information is translated from sequence to structure to function is key to natural product discovery. To address the first stage, sequence information, various tools have been developed that automatically detect BGCs from DNA sequence, including antiSMASH and its siblings fungiSMASH and plantiSMASH (1,2), GECCO (3), DeepBGC (4), RiPP-Miner (5) and PRISM 4 (6).
To facilitate dereplication and comparative analysis of predicted BGCs with known BGCs, and to characterise the interplay between sequence, structure and function, standardised data annotation and storage are essential. To this purpose, we developed the Minimum Information about a Biosynthetic Gene cluster (MIBiG) standard and built a database which contains standardised entries for experimentally validated BGCs of known function (7,8). Each entry minimally contains information about the nucleotide entry and coordinates of the genomic locus involved, the producing organism's taxonomy, biosynthetic class, name of the produced compound(s), and literature reference(s). There are also various optional fields for non-minimal entries, including fields for gene function, product structure and bioactivity, crosslinks to chemical structure databases such as NP Atlas (9) and PubChem (10), and monomer identity. With MIBiG 2.0 containing over 2000 entries, the database has become an important reference for many researchers that mine genomes for natural products. For example, it has been used to estimate the potential for biosynthetic novelty in large-scale microbiome studies (11,12), to identify conserved amino acids playing key roles in catalytic activities across enzyme families (13), to help guide natural product discovery efforts towards high-potential taxa (14), and to train machine-learning algorithms for natural product activity prediction (15).
Here, we present MIBiG 3.0: an update designed to increase the number of non-minimal entries in our database and adding new data entries through a large-scale community annotation effort. We focused on three features: the characterisation and cross-linking of 1188 chemical structures, the annotation of 1002 bioactivities of BGC products, and the validation and annotation of 2020 protein domain substrates of nonribosomal peptide synthetases (NRPSs). In addition, we added 661 novel BGCs to the MIBiG database which were published since the last database update and removed 69 duplicate and low-quality entries (Figure 1). Together, these additions keep the database current, and provide unique opportunities for exploring complex sequence-structure-function relationships in diverse natural product domains.

Manual curation through crowdsourcing and mass online 'annotathons'
As authors themselves typically have the best understanding of the BGC they have studied, we greatly encourage natural product researchers to submit their BGCs to MIBiG during the process of publishing their work. To this purpose, MIBiG supplies an online form through which researchers can request a unique MIBiG identifier and submit their experimentally verified BGCs, pre-or postpublication. Since MIBiG version 2.0, this has yielded 97 manually submitted, high-quality entries which have now been incorporated into MIBiG 3.0. Still, there are far more published BGCs that are not manually submitted to MIBiG.
With an increasing number of papers describing novel BGCs being published every year, manually annotating, validating and adding BGCs to MIBiG has become a mam-moth task. Therefore, we took to social media to gauge the community's interest in participating in an online annotation event. We received many positive responses, with 86 people from four different continents volunteering to participate in our MIBiG 'annotathons'. We organised eight three-hour online sessions, accommodating different timezones, with various breakout rooms dedicated to specific annotation tasks: annotating new clusters, annotating and cross-linking compound structures, annotating compound bioactivities, and assigning substrate selectivities to NRPS protein domains. We prepared multiple instruction videos and assigned an expert to each of the breakout rooms who could be directly approached with questions from annotators to ensure that annotation quality was consistent. In addition, one of our annotators at the CINVESTAV research institute mobilised fourteen MSc Integrative Biology students of their 2021 Bacterial Genomics class to annotate compound bioactivities under supervision. Finally, we resolved 125 database issues that were raised by users on our GitHub page, redefining BGC boundaries, correcting biosynthetic classes, adding and removing literature references, fixing compound structures, and removing duplicate entries.

Annotating and cross-linking compound structures
Since version 2.0, compound structures in MIBiG have been cross-linked to the NP Atlas database: a database containing structures of natural products isolated from bacteria and fungi. During the preparations for version 3.0, we collaborated with the NP Atlas team to (i) add structures for compounds in SMILES format (16), including stereochemical information where possible and (ii) cross-link them to five databases of chemical structures: NP Atlas (9), PubChem, ChemSpider (17), LOTUS (18), and ChEMBL (19). If compound entries were found in multiple databases, SMILES strings from NP Atlas were prioritised. SMILES strings were also collected for existing entries that were already cross-linked to a database but did not report a SMILES string. Correctness of SMILES syntax was validated with PIKAChU (20).

Annotating compound bioactivities
To improve MIBiG as a resource for machine learning models predicting sequence-structure-function relationships, we added bioactivity data for 1002 compounds and chemical target data for 95 compounds. 708 of these annotations were transferred from the dataset assembled by Walker and Clardy, who designed a machine learning model to predict BGC function from sequence (15). To accommodate consistent annotations, we assigned all existing and novel bioactivities to 68 standardised functional categories (Supplementary Table S1).

Annotating NRPS protein domains
To concretise the relationship between NRPS sequence and the structure of its produced nonribosomal peptide (NRP), we annotated and validated the substrate selectivities of 2775 NRPS adenylation (A) domains. A-domains dictate which monomers (predominantly amino acids) are incorporated into (hybrid) NRP scaffolds. Substrate annotation can be performed at different levels: we can define the pre-tailored substrate precursor (e.g. L-aspartic acid); the substrate as recognised by the A-domain (e.g. (3R)-3-hydroxy-L-aspartic acid); or the post-tailored integrated monomer that ends up in the final NRP scaffold (e.g. (3R)-3-hydroxy-D-aspartic acid). We chose to annotate the substrates as recognised by the A-domain, as this best reflects the biological relationship between A-domain and incorporated monomer. In addition to substrate identity, we also recorded evidence for substrate selectivity in the form of an evidence code and literature references. To this purpose, we added 13 evidence codes to the JSON schema which is used to standardise MIBiG entries (Table 1).
After community annotation, substrate naming was homogenised and each stereochemically ambiguous substrate was manually curated by an expert. Where stereochemistry could be inferred from structure, this is reflected in the substrate name for each stereocenter. Exceptions are amino acid names, which are assumed to be in their Lconfiguration. To avoid any ambiguity in substrate naming, we also linked each of our 274 unique substrate names to an As indicated, some evidence codes are only accepted as evidence for substrate specificity when combined with a second evidence code that provides further support for a data entry. Thirteen evidence codes were newly introduced in MIBiG 3.0. ACVS assay: ␦-(L-R-aminoadipyl)-L-cysteinyl-D-valine synthetase assay, specific for measuring penicillin production. HPLC: high-performance liquid chromatography. NMR: nuclear magnetic resonance.

Taking the 'minimal' out of MIBiG
While MIBiG 2.0 serves an important role in the community as a reference database to quickly identify whether a BGC is similar to any known BGCs, its utility as a resource for exploring sequence-structure-function relationships could be improved. This can mainly be explained by the high number of minimal entries in the database: entries that only contain sequence and compound information that could be augmented by adding further standardised annotations. For MIBiG 3.0, we aimed to promote as many existing and novel entries as possible to non-minimal entries by annotating compound structures (1188), bioactivities (1002) and NRPS substrates (2020). In total, we added 661 novel BGCs and 4871 separate data entries to our database, increasing our number of non-minimal entries from 486 to 928 (Figure 1, Supplementary Figure S1). MIBiG 3.0 now contains 2502 entries, spanning 16 phyla across 5 kingdoms of life (Table 2).

Streamlining research into the central dogma of specialised metabolism
With 905 NRPS and modular Type I PKS BGCs in MIBiG 3.0, modular BGCs constitute a substantial part of our database. Modular systems are characterised by enzyme complexes comprising repeating domain architectures, which collectively assemble a natural product scaffold. When the substrate selectivities of the recognition do-  mains are known (acyltransferase (AT) domains for PKS and A-domains for NRPS), these consistent architectures make it possible to predict the structure of chemical scaffolds with reasonable accuracy. Most AT domains in PKS systems recognise one of two substrates, malonyl-CoA or methylmalonyl-CoA, and excellent bioinformatics tools exist to distinguish between the two (21). However, for Adomains in NRPS systems, which recognise over 500 known substrates (22), substrate prediction is a greater challenge, which will require substantially more data to obtain models of comparably predictive power. Therefore, we decided to make the annotation of the substrate selectivity of NRPS A-domains a major focus of MIBiG 3.0. MIBiG 3.0 now contains annotations for 2775 A-domains (compared to 755 annotations in MIBiG 2.0; Figure 1B), covering 274 unique substrates which are identified by stereochemically curated isomeric SMILES strings (Figure 2; Supplementary Table  S2). This makes MIBiG the largest resource for A-domain substrate data, containing 3-4 times as many labelled data points as the training sets used for the A-domain selectivity predictors SANDPUMA (23) and NRPSPredictor2 (24). We hope that eventually this dataset will be leveraged to train an improved A-domain substrate predictor, which can in turn be integrated into tools like antiSMASH to improve NRP scaffold structure prediction. Since version 2.0, we have added structural identifiers of 1188 compounds to our database in SMILES format (16), increasing the number of BGCs with structural data from 1347 to 1860 (Figure 1). By pulling SMILES strings directly from cross-linked databases where possible, we avoid conflicts caused by versioning and SMILES formatting. Additionally, we linked 1002 additional compounds to 51 unique bioactivities, creating opportunities for computationally predicting compound bioactivity from structure. For a further 95 compounds, we were also able to annotate their molecular targets ( Figure 1B).
By centering MIBiG 3.0 around the annotation of substrate building blocks, compound structures, and bioactivities, we aspired to streamline future research into all aspects of sequence-structure-function relationships that lie at the heart of natural product research. All data can be easily downloaded and parsed in bulk from our database in JSON and GenBank format or accessed on an entry-by-entry basis through our searchable online repository. As such, we hope that MIBiG 3.0 will prove an important resource for future machine learning endeavours that aim to decode the central dogma of specialised metabolism.

DATA AVAILABILITY
The MIBiG Repository is available at https://mibig. secondarymetabolites.org/. There is no access restriction for academic or commercial use of the repository and its data. The source code components, JSON-formatted data standard, and SQL schema for the MIBiG Repository are available on GitHub (https://github.com/mibig-secmet) under an OSI-approved Open Source licence.