-
Notifications
You must be signed in to change notification settings - Fork 27
BiG SCAPE Workflows
BiG-SCAPE 2 can be run in 3 workflows: cluster
, query
, and benchmark
. BiG-SCAPE Cluster performs clustering of BGCs into GCFs. This is the equivalent of running BiG-SCAPE 1’s bigscape.py. With BiG-SCAPE Query you can search for BGCs that show similarity to a user provided query BGC/.gbk
, and BiG-SCAPE Benchmark compares the results of a BiG-SCAPE 2 Cluster mode run, BiG-SCAPE 1 run or BiG-SLiCE run against a user-provided set of BGC <-> GCF assignments.
Clustering mode - BiG-SCAPE 2 performs clustering of BGCs into GCFs. This is the equivalent of running BiG-SCAPE 1’s bigscape.py.
BiG-SCAPE 2 has primarily been designed to work with antiSMASH processed .gbk
files, and will (recursively) look for these files in the --input-dir
/--gbk-dir
.
By default, only files with the strings cluster
(antiSMASH 4) or 'region'
(antiSMASH 5+) in their name will be included, and files with the string final
in their name will be excluded. These defaults can be adjusted with the --include-gbk
and --exclude-gbk
parameters. The filenames will correspond to the BGC names in the output and interactive visualization.
If two different files share the same name, BiG-SCAPE 2 can, in principle, handle these conflicts as internally each file is hashed based on its content. However we strongly advise against using duplicated names for different files, as it will likely lead to confusion when processing and interpreting results.
Furthermore non-antiSMASH processed .gbk
files can, in principle, be used by toggling the --force-gbk
option. BiG-SCAPE 2 will still require these alternative .gbks
to contain CDS
and Sequence
features, and you will need to either adjust the --include-gbk
defaults, or add ‘region’/’cluster’ to the file names. Please note that this feature is in its beta state and has not been extensively tested, so use with caution.
.gbk files can also be filtered based on their antiSMASH class/category, by making use of the --include-class/-category
and --exclude-class/-category
options, and based on the presence of specific profile hmm domains (commonly, PFAM domains) listed in a text file and provided to the --domain-includelist-all/-any
options.
If two CDS features overlap (e.g. splicing events), BiG-SCAPE's behavior is to allow for a maximum of 10%
of the shortest CDS. If more overlap is detected, BiG-SCAPE will discard the smallest feature from the analysis. Similarly, if two detected protein domains overlap within the same CDS, BiG-SCAPE's behavior is to allow for a maximum of 10%
of the shortest domain, and will keep the best scoring domain. These percentages can be updated in the config.yml file.
Minimum and maximum BGC length can also be determined (config.yml: MIN/MAX_BGC_LENGTH
).
If desired, BiG-SCAPE 2 can make use of two types of (antiSMASH processed) reference gene clusters: MIBiG .gbk
files and user defined reference .gbk
files. Please see this section for more detail.
BiG-SCAPE 2 uses a phmm database for domain detection, and does this with the hmmscan
tool from the HMMER suite.
Most commonly, the Pfam phmm database is used for this, and therefore BiG-SCAPE expects all phmms in this database to follow Pfam formatting. The current relase of the Pfam database can be downloaded here. If the .hmm file has already been pressed and the pressed files are included in the same folder as the Pfam .hmm file, BiG-SCAPE will also use these pressed files. If this is not the case, BiG-SCAPE will run hmmpress
(and thus requires the user to have write permissions to the given Pfam folder).
In principle, however, a user can define any given phmm database so long as Pfam formatting is used.
Note: BiG-SCAPE 2 is not yet equipped to handle runs with multiple phmm databases being stored in the same SQLite DB, so we advise starting with a fresh SQLite DB anytime a user wants to use a different phmm database.
BiG-SCAPE 2 output can be divided into several file categories:
-
Log files (
.log
and.config.log
) -
SQLite database (
[parent_folder_name].db
) which stores all record data and can be reused for subsequent runs. It additionally stores all generated edges between record pairs as well as the GCF assignments for all runs in which it has been used. -
The main
[output_dir]/index.html
file which hosts the full interactive visualization.- Launch the interactive output by clicking on the
index.html
file or opening the file with any web browser. This file is located in the root of the output folder. When opening the visualization page, you will be asked to load your result database file. This file can be found under the output folder, in the same location as the index.html file. Since the index.html does not contain any run-data (it loads this from the selected database), any index.html file from any run can load data from any output database.
- Launch the interactive output by clicking on the
-
Other output files, stored in
/output_files
, and in each relevant/run_cutoff
folder, including:- GCF tree
.newick
files, and the fasta alignments files used to construct these trees - A
.network
tsv file for each relevant run/cutoff. This file can be imported into external network visualization software such as cytoscape. In essence, this file contains an edge list with a number of node and edge attributes:- BGC record name
- Record type
- Record number
- orf-based comparable region coordinates
- BiG_SCAPE distance and distance components (DSS, Jaccard and Adjacency indexes).
- Alignment mode
- Extend strategy
- A
clustering_c[cutoff].tsv
file with the BGC -> GCF assignments. - A
record_annotations.tsv
file with information about each BGC that was successfully processed in the input. This includes:- BGC record name
- Record type
- Record number
- antiSMASH class and category
- Organism
- Taxonomy
- Description feature from the
.gbk
file
- A
full.network
tsv file per run, which contains the ‘raw’ distances before any cutoffs are applied (equivalent to running BiG-SCAPE 2 with a GCF_cutoff of 0).
- GCF tree
BiG-SCAPE 2 lets users interact with their result data via an interactive UI, which can be accessed by launching the main output_dir/index.html
file. Upon doing so, the user will be prompted to select a SQLite database file to read from. Upon loading the SQLite database, the user will then be able to navigate their data in this UI, which consists of the following elements (Fig. 2):
- 1) Theme selection (Auto, Light, Dark)
- 2) Bin selection, contents of this dropdown menu will depend on which, if any, classification mode was used
- 3) Database selection
- 4) Run selection, this dropdown menu will feature every run_cutoff that was run and saved to the loaded database, e.g. run_1 was done with cutoffs 0.3,0.5, Runs: run_1_c0.3, run_1_c0.5.
- 5) Overview section, featured at the top half of the page, and which can also be re-centered with the ‘Overview’ button.
- a) Run Information section, features a selection of modes and arguments used in the currently selected run.
- b) Input Data section, features information on the total number of genomes, BGCs, and distribution of BGCs per genome and per antiSMASH class.
- c) Network Overview section, features information on GCF characteristics, as well as a GCF/Genome heatmap where GCFs/Genomes can be clustered based on several metrics. NOTE: while genomes can be clustered by accession, this compares the accessions solely in terms of matching characters, and does not make use of a formal taxonomy.
- 6) Network section, featured at the bottom half of the page, can be navigated to with the ‘Network’ button.
- a) Connected component (CC) table, contains a list of all CCs, with respective families and number of records. This table can be filtered by making use of the ‘Filter Table’ fields, which accepts several descriptors. Once a selection is made, this can also be downloaded into a .tsv file using the ‘Download Current Selection’ button.
- b) A network can be visualized by selecting a specific CC, in which case the nodes and edges of the respective CC will be loaded into view. The entire run’s set of nodes and edges can also be viewed using the ‘Visualize All’ button, but for large datasets it is possible that loading the entire dataset is too computationally intensive.
- i) Reference nodes will be circled in blue.
- ii) In query mode, the query node will be circled in green.
- iii) When using Protocluster or Protocore as the record type, nodes in topologically connected components will be faded.
- c) Nodes can also be searched/filtered within a loaded network by making use of the (advanced) search, and the results of this selection can also be downloaded using the ‘download’ button.
- d) Node&GCF Detail section
- i) Nodes can be selected by hovering and clicking on them, and they will appear in the Node&GCF Detail section. Here, the top half features a visual representation of the BGCs and their domains (which can be expanded), while the bottom half shows a list of the selected BGCs and the families they belong to.
- ii) Clicking a GCF in the Node&GCF Detail section will trigger a pop up window that displays the GCF tree, with a highlighted tree exemplar: the record that all other family members were aligned to. Hovering over each domain will display the domain’s accession and name, score, and location within its ORF. Hovering over each ORF will display its position in the BGC. Domain display can be toggled off.
- iii) Here, and in d) when using Protocluster or Protocore as the record type, the entire
.gbk
is shown, and domains/ORFs not belonging to the relevant record are faded.
Fig 2. Screenshot of BiG-SCAPE 2.0 user interface (example: JK1 dataset, CC ## run in Cluster mode, using --record type protocluster
, --mix
, --classify none
, and otherwise with all default parameters).
The BiG-SCAPE 2’s Query workflow was designed to facilitate searches of BGCs that show similarity to a user provided query BGC .gbk
.
In this mode, the user must provide the path to a query BGC .gbk
, which can be present in the /input_dir or anywhere else. All remaining BGCs residing in the /input_dir
will be considered by BiG-SCAPE Query as references. Additionally, MIBiG references and other user-defined references can be used, following the same usage as in BiG-SCAPE Cluster.
By default, BiG-SCAPE 2 will perform one set of query-vs-all comparisons. Alternatively, with --propagate
, this first set of comparisons is followed by an iterative set of reference-vs-reference comparisons which will propagate the connected component until no more edges are created. Lastly, in both cases, any missing edges between newly connected reference nodes are calculated.
Further Input and Output features are consistent with BiG-SCAPE Cluster.
In the benchmarking workflow BiG-SCAPE 2 compares the results of a BiG-SCAPE 2 Cluster mix
mode run (see more detail here), BiG-SCAPE 1 mix
mode run, or BiG-SLiCE (v1 or v2) run against a user-provided benchmark set of BGC <-> GCF assignments.
BiG-SCAPE Benchmark requires the user to provide a GCF assignments file, i.e. a tab separated file which must have the same format as the clustering_cutoff.tsv
output file. A header
line that starts with a #
can also be added.
Note: in the cases where a CC
column is present in the clustering_cutoff.tsv
, this column needs to be added to the GCF assignments file purely for formatting reasons, but its contents will not be used by BiG-SCAPE Benchmark, and any random number can be filled.
Additional required input is the output directory of the BiG-SCAPE/SLiCE run (i.e. what was originally passed to -o
) to be analyzed (a mix
network/bin must be present).
For both BiG-SCAPE v1 and v2, bigscape benchmark
will look for generated output .tsv
files to obtain computed GCFs. If multiple runs are present in the output folder, all used cutoffs in the most recent run (where a mix
network/bin must be present) will be analyzed.
When benchmarking BiG-SLiCE, BiG-SCAPE Benchmark will read directly from its SQLite database and look for the most recent run for each unique cutoff/threshold.
BiG-SCAPE Benchmark outputs a subfolder for each analyzed cutoff (cutoff_[used cutoff]
) with various external cluster evaluation metrics based on the overlap between formed BiG-SCAPE GCFs and provided GCF assignments:
-
entropy and purity for each generated GCF (
Entropies_[label].tsv
,Purities_[label].tsv
) -
confusion matrix
.tsv
and heatmap visualization.png
showing how many members of each provided family end up in which BiG-SCAPE/SLiCE GCF (Fig X.). - Main
Summary_[label].tsv
that contains various statistics:- The number and average size of curated and computed families.
- V-measure and its components homogeneity and completeness.
- Average Purity and Entropy.
- Fractions of correct/wrong associations per BGC, denoting how formed associations agree with the curated set of assignments (i.e. members of different curated families in the same computed family are seen as wrong associations).
- Fractions of present/missing associations per BGC, denoting how many expected associations are present based on the curated set of assignments.
Finally, these metrics are summarized over all cutoffs found in the BiG-SCAPE/SLiCE run Benchmark_summary_[label].tsv
as well as a Scores_per_cutoff_[label].png
plot focused around the V-measure (Fig 3.).
Fig 3. (left) Heatmap visualization of the confusion matrix showing the overlap of membership assignments between curated and computed GCF. (right) Plotted summary of benchmark metrics per used cutoff.
Note: As you can see, BiG-SCAPE Benchmark is quite a different use-case when compared to the Cluster and Query workflows. Therefore, in the following sections, when we say BiG-SCAPE 2, we are primarily referring to BiG-SCAPE 2 Cluster & BiG-SCAPE 2 Query.