Introduction

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How do I run BGCflow in my computer?

Objectives

Getting a copy of BGCflow in your own machine

Welcome to the BGCFlow tutorial! BGCFlow is a stack of bioinformatic tools that enables you to analyze biosynthetic gene clusters (BGCs) across large collections of genomes (pangenomes) using the Snakemake workflow manager. In this tutorial, you will learn how to set up and use BGCFlow to streamline your BGC analysis.

Before you start

Before you dive into using BGCFlow, there are a few prerequisites you need to have in place

Access to a Linux Environment

BGCFlow is currently tested on Linux (specifically Ubuntu 22). If you’re a Windows user, don’t worry, you have options too:
You can set up a Linux virtual machine (VM) using cloud providers like Microsoft Azure, AWS, or Google Cloud.
Windows users can also utilize Windows Subsystem for Linux (WSL2) to create a Linux-like environment.
Conda or Mamba Package Manager

BGCflow requires a UNIX based environment with Conda installed. If you’re not familiar with Conda, you can find installation instructions here. Additionally, we recommend considering Mamba, an alternative package manager. If you’re not using Mambaforge, you can install Mamba into any other Conda-based Python distribution with the following command:
conda install -n base -c conda-forge mamba
With these essentials in place, you’re ready to proceed with setting up BGCFlow!

Installing BGCFlow environments

Creating a Conda Environment

Let’s start by creating a dedicated Conda environment for BGCFlow:

# create and activate a new conda environment
mamba create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase

PS: This environment includes Java / OpenJDK which is required for serving metabase.

Installing the BGCFlow wrapper

Next, we’ll install the BGCFlow wrapper, a command-line interface that makes working with BGCFlow’s Snakemake workflows easier:

# activate bgcflow environment
conda activate bgcflow

# install `BGCFlow` wrapper
pip install bgcflow_wrapper

# make sure to use bgcflow_wrapper version >= 0.2.7
bgcflow --version # show the wrapper version

# Verify the installation and view available command line arguments
bgcflow --help

The above command will display the help function of the wrapper, providing a comprehensive list of available command line arguments:

$ bgcflow --help
Usage: bgcflow [OPTIONS] COMMAND [ARGS]...

  A snakemake wrapper and utility tools for BGCFlow
  (https://github.com/NBChub/bgcflow)

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  build       Build Markdown report or use dbt to build DuckDB database.
  clone       Get a clone of BGCFlow to local directory.
  deploy      [EXPERIMENTAL] Deploy BGCFlow locally using snakedeploy.
  get-result  View a tree of a project results or get a copy using Rsync.
  init        Create projects or initiate BGCFlow config from template.
  pipelines   Get description of available pipelines from BGCFlow.
  run         A snakemake CLI wrapper to run BGCFlow.
  serve       Serve static HTML report or other utilities (Metabase,...
  sync        Upload and sync DuckDB database to Metabase.

Additional pre-requisites

While the wrapper installation covers the basics, there are a few additional steps to ensure smooth operation:

Adjust Conda Channel Priorities: Set the Conda channel priorities to ‘flexible’ for a more adaptable package resolution: With the environment activated, install or setup this configurations:
- Set conda channel priorities to flexible
```
conda config --set channel_priority disabled
conda config --describe channel_priority
```
  With these steps completed, you’ve laid a solid foundation for using BGCFlow effectively.

Deploying a local copy of BGCFlow

Before you can start genome mining on your dataset, you’ll need to deploy the BGCFlow workflow locally. Follow the steps below to get BGCFlow set up on your machine.

Cloning BGCFlow Repository

Navigate to the directory where you want to store your BGCFlow installation. Make sure you have a decent amount of storage space available.
Run the following command to clone the BGCFlow repository to your chosen destination:
```
bgcflow clone <my BGCFlow folder>
```
Replace <my BGCFlow folder> with the path to your desired destination directory.

Additionally, you can specify a specific branch or release tags of BGCFlow to clone using the --branch option. By default, the main branch will be cloned.

bgcflow clone --branch v0.8.1 <my BGCFlow folder>

You can also manually grab other releases from the the github repository tags and click on the release/tags of your choice. On the assets, you can click the source code link to download it.

Initiating configuration from template

To run BGCFlow, users need to set up a Snakemake configuration file and the PEP file for each projects. To initiate an example config.yaml from template and run an example project, do:

bgcflow init

You can modify the configuration located in config/config.yaml to point out to the right project PEPs.

projects:
  - name: config/Lactobacillus_delbruecki/project_config.yaml

Running the Lactobacillus tutorial dataset

For the tutorial, we will be running a small dataset of Lactobacillus delbrueckii genomes from NCBI.

You can do a dry-run (checking the workflow without actually running it) using:

bgcflow run -n

When you are ready to run the workflow, remove the -n flag. You can see the full parameters by running bgcflow run --help. For now, let it be and skip to the next section of the tutorial.

Download the latest release/tags

The command bgcflow clone utilizes git to get the latest branch of BGCFlow. You can also grab other releases from the the github repository tags at https://github.com/NBChub/bgcflow/tags and click on the latest release/tags. On the assets, you can click the source code link to download it.

Key Points

A Snakemake environment is required to run BGCflow

bgcflow_wrapper provides shortcut to commonly used Snakemake commands and other tools integrated in BGCFlow

BGCflow project structure

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How do I set my BGCflow project?

Objectives

Setting up configuration and files to start a BGCflow project

Setting up your Project Config

Let’s move into our freshly cloned BGCflow folder

cd bgcflow

To run BGCFlow, we need to activate the BGCFlow conda environment and invoke the command line interface:

conda activate bgcflow
bgcflow --help

Setting up your project samples

In this tutorial, we will analyse 18 publicly available Streptomyces venezuelae genomes. To do that, we will need to prepare a sample.csv file with a list of the genome accession ids. Please find the sample.csv for the tutorial here: https://github.com/NBChub/bgcflow_tutorial/blob/gh-pages/files/samples.csv.

This is how the top of the table looks like:

genome_id	source	organism	genus	species	strain	closest_placement_reference
GCF_000253235.1	ncbi				ATCC 10712
GCF_001592685.1	ncbi				WAC04657
…	…	…	…	..	…	…

There are two required columns that needs to be filled:

genome_id [required]: The genome accession ids (with genome version for ncbi and patric genomes). For custom fasta file provided by users, it should refer to the fasta file names stored in data/raw/fasta/ directory with .fna extension.
source [required]: Source of the genome to be analyzed choose one of the following: custom, ncbi, patric. Where:
- custom : for user provided genomes (.fna) in the data/raw/fasta directory with genome ids as filenames
- ncbi : for list of public genome accession IDs that will be downloaded from the NCBI refseq (GCF…) or genbank (GCA…) database
- patric: for list of public genome accession IDs that will be downloaded from the PATRIC database

The other optional metadata that we can provide are:

organism [optional] : name of the organism that is same as in the fasta header
genus [optional] : genus of the organism. Ideally identified with GTDBtk.
species [optional] : species epithet (the second word in a species name) of the organism. Ideally identified with GTDBtk.
strain [optional] : strain id of the organism
closest_placement_reference [optional]: if known, the closest NCBI genome to the organism. Ideally identified with GTDBtk.

For now, we will only provide the strain information and leave the taxonomic information blank. For publicly available genomes, BGCflow will try to fetch the taxonomic placement from GTDB release 202.

To set up the config, we will create a project folder named s_venezuelae inside the config directory:

mkdir -p config/s_venezuelae
wget -O config/s_venezuelae/samples.csv https://raw.githubusercontent.com/NBChub/bgcflow_tutorial/gh-pages/files/samples.csv

Define your project in the config.yaml

Now that we have our sample file ready, we need to define our project in the config.yaml. An example of the config file can be found in the config/examples/_config_example.yaml. Let’s copy that file and use it as a template:

cp config/examples/_config_example.yaml config/config.yaml

Let’s open the newly created config.yaml. The newly copied config.yaml will have this example project defined:

# Project 1
  - name: example
    samples: config/examples/_genome_project_example/samples.csv
    rules: config/examples/_genome_project_example/project_config.yaml
    prokka-db: config/examples/_genome_project_example/prokka-db.csv
    gtdb-tax: config/examples/_genome_project_example/gtdbtk.bac120.summary.tsv

Replace those lines with this:

# Project 1
  - name: s_venezuelae
    samples: config/s_venezuelae/samples.csv
#    rules: config/examples/_genome_project_example/project_config.yaml
#    prokka-db: config/examples/_genome_project_example/prokka-db.csv
#    gtdb-tax: config/examples/_genome_project_example/gtdbtk.bac120.summary.tsv

For now, we will just use two mandatory variables:

name : name of your project
samples : a csv file containing a list of genome ids for analysis with multiple sources mentioned. Genome ids must be unique.

PS: Commented lines will not be processed.

Checking the jobs with Snakemake dry-run

Before, we use the -n when calling snakemake. This parameters means a dry-run, which is used to simulate all the jobs that will be run given the information provided in the config file. You can also add the parameters -r to see the reason why those jobs are generated.

Let’s do a dry run to check if our project is successfully configured:

snakemake -n -r

Challenges

How many jobs will be generated?

How many antismash jobs will be generated?

Why do each jobs only run with 1 core?

What do you think localrule all does?

Key Points

A project requires a PEP configuration and a .csv file containing a list of genomes to analyse

The project is then defined in the config.yaml file

An example of a project config can be found in config/examples

Selecting rules for analysis

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How do I set which rules to run?

Objectives

Getting able to select specific BGCflow rules to run

Global Rules

In the previous tutorial, when doing a dry-run, Snakemake will generate this jobs:

Job stats:
job                          count    min threads    max threads
-------------------------  -------  -------------  -------------
all                              1              1              1
antismash                       18              1              1
antismash_db_setup               1              1              1
antismash_summary                1              1              1
downstream_bgc_prep              1              1              1
extract_meta_prokka             18              1              1
extract_ncbi_information         1              1              1
fix_gtdb_taxonomy                1              1              1
format_gbk                      18              1              1
gtdb_prep                       18              1              1
ncbi_genome_download            18              1              1
prokka                          18              1              1
write_dependency_versions        1              1              1
total                          115              1              1

In BGCflow, rules are defined globally in the config.yaml. Rules can be turned on or off by setting the value for each specific rules with TRUE or FALSE.

#### GLOBAL RULE CONFIGURATION ####
# This section configures the rules to run globally.
# Use project specific rule configurations if you want to run different rules for each projects.
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
  seqfu: FALSE
  mlst: FALSE
  refseq-masher: FALSE
  mash: FALSE
  fastani: FALSE
  checkm: FALSE
  prokka-gbk: FALSE
  antismash-summary: TRUE
...

Notice that the only rules set to TRUE are antismash-summary. So why does Snakemake generate a total of 115 jobs? That is because to generate the final output of antismash-summary, other job outputs are required. Snakemake will trace back all the file dependency of the final output and call every jobs require to generate that file.

The rule setting in the config.yaml applies globally. So if you several projects that require the same analysis, the rule setting will be applied to all those projects.

Project specific rule

What if I want to run different analysis for each projects? For this application, you can override the global rule setting by giving each project a .yaml file containing project specific rule setting. The file contents is exactly the same as the global rule configuration:

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
  seqfu: TRUE
  mlst: FALSE
  refseq-masher: FALSE
  mash: FALSE
  fastani: FALSE
  checkm: FALSE
  prokka-gbk: FALSE
  antismash-summary: FALSE
  antismash-zip: FALSE
  query-bigslice: FALSE
  bigscape: FALSE
  bigslice: FALSE
  automlst-wrapper: FALSE
  arts: FALSE
  roary: FALSE
  eggnog: FALSE
  eggnog-roary: FALSE
  diamond: FALSE
  diamond-roary: FALSE
  deeptfactor: FALSE
  deeptfactor-roary: FALSE
  cblaster-genome: FALSE
  cblaster-bgc: FALSE

In this rule, we only set to run seqfu, which will quickly give us a summary statistics of our genome sequences. Copy and paste the rule setting above to a new file called project_config.yaml, and save it in the project config folder: config/s_venezuelae/project_config.yaml.

Now, we need to edit the config.yaml so that BGCflow knows that we would like to override the global rule configuration for our s_venezuelae project. Replace the rules values with this:

# Project 1
  - name: s_venezuelae
    samples: config/s_venezuelae/samples.csv
    rules: config/s_venezuelae/project_config.yaml
#    prokka-db: config/examples/_genome_project_example/prokka-db.csv
#    gtdb-tax: config/examples/_genome_project_example/gtdbtk.bac120.summary.tsv

Let’s do a dry-run again:

snakemake -n -r

Challenges

How many jobs will be generated?

Do we succeed in overriding the global rule settting?

Key Points

BGCflow rules can be selected by giving a TRUE value to the rules config

Global rules applied to all projects and defined in the config.yaml

Project specific rules can be given to each project and will override the global rule

Project specific rules are written as a separate .yaml file

BGCflow data structure

Overview

Teaching: 0 min
Exercises: 5 min

Questions

Where can I find the analysis result of my run?

Objectives

Finding relevant BGCflow analysis result

Exploring the output of BGCflow

In this workshop, we have finished running all analysis for our s_venezuelae project. You can find the result in the VM at: /datadrive/bgcflow/data.

tree -L 2 /datadrive/bgcflow/data/

You can also generate a symlink to that directory so you can explore it using VS Code:

tree -L 3 /datadrive/bgcflow/data/

BGCflow adopt the cookiecutter data science directory structure. Output files can be found in the data directory, and are splitted into three different stages:

The processed directory contained most of the output required for downstream analysis
The interim directory contained a direct output from each rules that are not fined tuned for downstream analysis
The raw directory contained user provided input files

Give yourself time to look through the different output directories.

Which files are important?

It depends. Different research questions will require different analysis, and therefore different files are required for downstream analysis. In the Natural Products Genome Mining group, we aim to aid students and researchers by giving an example of Jupyter notebooks to process each output types. It is a work in progress and we are open to anyone who would like to contribute.

In the next session, I will give an example to do exploratory data analysis on BiG-SLICE query result against the BiG-FAM database.

Key Points

BGCflow adopt the cookiecutter data science directory structure

Output files can be found in the data directory, and are splitted into three different stages

The processed directory contained most of the output required for downstream analysis

Part I: Exploring BiG-SLICE query result

Overview

Teaching: 0 min
Exercises: 20 min

Questions

How do I explore BiG-SLICE query result?

Objectives

Visualize query hits to BiG-FAM GCF models with Cytoscape

Exploring BiG-SLICE Query result

In this episode, we will explore BiG-SLICE query hits of S. venezuelae genomes with the BiG-FAM database (version 1.0.0, run 6). First, let us grab the BiG-SLICE query result from bgcflow run.

If you haven’t done so, generate a symlink to the bgcflow run on the VM:

ln -s /datadrive/bgcflow bgcflow

Open up this folder /home/bgcflow/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1 and download the query_network.csv and gcf_summary.csv to your local computer.

Using Cytoscape, load and visualize your BiG-FAM hits following this video:

1. Importing network and annotation from tables

2. Filtering top n hits to disentagle the network

Enriching the annotation with other results from BGCflow run

As we can see, without further annotation about the BGC information, the network does not give us much information. In the next step, we will combine summary tables from two other BGCflow outputs:

BiG-SCAPE
GTDB taxonomy

We have prepared a jupyter notebook table that will generate the annotation required to enrich our network.

Getting Jupyter up and running

On your home directory, create a folder called s_venezuelae_tutorial/notebook. You can then download the .ipynb file of this episode and run it from the directory you just made.

mkdir -p s_venezuelae_tutorial/notebook
wget -O s_venezuelae_tutorial/notebook/05-bigslice_query.ipynb {link to .ipynb file}

If you’re using the VM for the workshop, activate the bgc_analytics conda environment by:

conda activate bgc_analytics

Then, run jupyter lab with:

jupyter lab --no-browser

VS code will forward a link to the jupyter session that you can open in your local machine.

Go to your notebook directory and start exploring the notebook to generate the table

Key Points

BGCflow returns an edge table of your BGC query to the top 10 hits of GCF models in the BiG-FAM database

Exploring BiG-SLICE query result

Overview

Teaching: 0 min
Exercises: 20 min

Questions

How do I annotate BiG-SLICE query result?

Objectives

Enrich BiG-FAM hits with other BGCflow results

In this episode, we will explore BiG-SLICE query hits of S. venezuelae genomes with the BiG-FAM database (version 1.0.0, run 6). You can download the .ipynb file of this episode and run it from your own directory.

BGCflow Paths Configuration
Raw BiG-SLICE query hits
A glimpse of the data distribution
Annotate Network with information from BiG-SCAPE and GTDB
Import the annotation to Cytoscape

Libraries & Functions

# load libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
from pathlib import Path
import json

# generate data
def gcf_hits(df_gcf, cutoff=900):
    """
    Filter bigslice result based on distance threshold to model and generate data.
    """
    mask = df_gcf.loc[:, "membership_value"] <= cutoff
    df_gcf_filtered = df_gcf[mask]
    bgcs = df_gcf_filtered.bgc_id.unique()
    gcfs = df_gcf_filtered.gcf_id.unique()
    print(
    f"""BiG-SLICE query with BiG-FAM run 6, distance cutoff {cutoff}
Number of bgc hits : {len(bgcs)}/{len(df_gcf.bgc_id.unique())}
Number of GCF hits : {len(gcfs)}""")
    return df_gcf_filtered

# visualization
def plot_overview(data):
    """
    Plot BGC hits distribution from BiG-SLICE Query
    """
    ranks = data.gcf_id.value_counts().index
    sns.set_theme()
    sns.set_context("paper")
    fig, axes = plt.subplots(1, 2, figsize=(25, 20))
    plt.figure(figsize = (25,25))

    # first plot
    sns.histplot(data=data, x='membership_value', 
                 kde=True, ax=axes[0],)

    # second plot
    sns.boxplot(data=data, y='gcf_id', x='membership_value', 
                orient='h', ax=axes[1], order=ranks)

    # Add in points to show each observation
    sns.stripplot(x="membership_value", y="gcf_id", data=data,
                  jitter=True, size=3, linewidth=0, color=".3", 
                  ax=axes[1], orient='h', order=ranks)
    return

BGCflow Paths Configuration

Customize the cell below to your BGCflow result paths

# interim data
bigslice_query = Path("/datadrive/home/matinnu/bgcflow_data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/")

# processed data
bigslice_query_processed = Path("/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/")
bigscape_result = Path("/datadrive/home/matinnu/bgcflow_data/processed/s_venezuelae/bigscape/for_cytoscape_antismash_6.0.1/2022-06-21 02_46_35_df_clusters_0.30.csv")
gtdb_table = Path("/datadrive/home/matinnu/bgcflow_data/processed/s_venezuelae/tables/df_gtdb_meta.csv")

# output path
output_path = Path("../tables/bigslice")
output_path.mkdir(parents=True, exist_ok=True)

Raw BiG-SLICE query hits

First, let’s see the raw data from BiG-SLICE query hits. We have extracted individual tables from the SQL database.

! tree /datadrive/bgcflow/data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/

[01;34m/datadrive/bgcflow/data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/[00m
├── 1.db
├── bgc.csv
├── bgc_class.csv
├── bgc_features.csv
├── cds.csv
├── gcf_membership.csv
├── hsp.csv
├── hsp_alignment.csv
├── hsp_subpfam.csv
├── schema.csv
└── sqlite_sequence.csv

0 directories, 11 files

We will first look at these two tables:

bgc table or the file bgc.csv
gcf_membership table or the file gcf_membership.csv

The data from the interim folder has been processed for downstream analysis in the processed directory:

! tree /datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/

[01;34m/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/[00m
├── gcf_summary.csv
├── gcf_summary.json
└── query_network.csv

0 directories, 3 files

# load the two tables
df_bgc = pd.read_csv(bigslice_query / "bgc.csv")
# BGC table from bigslice
df_bgc.loc[:, 'genome_id'] = [Path(i).name for i in df_bgc.orig_folder] # will be put in the main bgcflow code
# GCF membership table from bigslice
df_gcf_membership = pd.read_csv(bigslice_query / "gcf_membership.csv")

The gcf_membership table lists the top 10 closest BiG-FAM GCF models (order shown as rank column) and the euclidean distance to the model (membership_value). Smaller membership_value means that our BGC has a closer or more similar features with the models. Note that we are querying against run 6 in the BiG-FAM model, with threshold of 900, Therefore, if the membership_value is above 900, it is less likely that our BGC belongs to that gcf model.

df_gcf_membership.head()

	gcf_id	bgc_id	membership_value	rank
0	203601	448	669	0
1	208355	448	798	1
2	200946	448	826	2
3	207795	448	882	3
4	213140	448	885	4

The bgc_id column in gcf_membership table refers to the id column in bgc table. Therefore, we can enrich our hits with the metadata contained in bgc table

df_bgc.head()

	id	name	type	length_nt	orig_folder	orig_filename	genome_id
0	1	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00863...	as6	29650	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00863...	NZ_CP029197.1.region004.gbk	GCF_008639165.1
1	2	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00870...	as6	11401	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00870...	NZ_CP029195.1.region011.gbk	GCF_008705255.1
2	3	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00025...	as6	10390	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00025...	NC_018750.1.region027.gbk	GCF_000253235.1
3	4	data/interim/bgcs/s_venezuelae/6.0.1/GCF_02104...	as6	25706	data/interim/bgcs/s_venezuelae/6.0.1/GCF_02104...	NZ_JAJNOJ010000002.1.region009.gbk	GCF_021044745.1
4	5	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00929...	as6	20390	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00929...	NZ_CP023693.1.region020.gbk	GCF_009299385.1

A glimpse of the data distribution

Sanity Check: How many gene clusters predicted by antiSMASH?

print(f"There are {len(df_bgc)} BGCs predicted from {len(df_bgc.orig_folder.unique())} genomes.")

There are 515 BGCs predicted from 18 genomes.

How many BGCs can be assigned to BiG-FAM gene cluster families?

BiG-SLICE calculate the feature distance of a BGC to BiG-FAM models (which is a centroid of each Gene Cluster Families generated from 1.2 million BGCs). Though it’s not that accurate, it can give us a glimpse of the distribution of our BGCs within the database.

sns.set_theme()
sns.set_context("paper")
sns.histplot(df_gcf_membership, x='membership_value', kde=True)
for c in [900, 1200, 1500]:
    gcf_hits(df_gcf_membership, c)

BiG-SLICE query with BiG-FAM run 6, distance cutoff 900
Number of bgc hits : 382/515
Number of GCF hits : 114
BiG-SLICE query with BiG-FAM run 6, distance cutoff 1200
Number of bgc hits : 487/515
Number of GCF hits : 220
BiG-SLICE query with BiG-FAM run 6, distance cutoff 1500
Number of bgc hits : 504/515
Number of GCF hits : 351

png

Depending on the distance cutoffs, we can assign our BGCs to a different numbers of GCF model. The default cutoffs is 900 (run 6). In our data, 382 out of 515 BGCs can be assigned to 114 BiG-FAM GCF. Do note that the number of assigned GCF can be smaller if we only consider the first hit (the query returns 10 hits).

Smaller number means a closer distance to the GCF model. For further analysis, we will stick with the default cutoff.

A closer look to BiG-FAM distributions

BGCflow already cleans the data for downstream processing. The processed bigslice query can be found in bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/.

data = pd.read_csv(bigslice_query_processed / "query_network.csv")
plot_overview(data)

png

<Figure size 1800x1800 with 0 Axes>

On the figure above, we can see the distance distribution of our query to the model, and how each models have varying degree of distances. But, this data includes the top 10 hits, so 1 BGCs can be assigned to multiple GCFs.

Let’s see again only for the first hit.

n_hits = 1 # get only the first hit
n_hits_only = data.loc[:, "rank"].isin(np.arange(n_hits))
data_1 = data[n_hits_only]
print(f"For the top {n_hits} hit, {len(data_1.bgc_id.unique())} BGCs can be mapped to {len(data_1.gcf_id.unique())} GCF")
plot_overview(data_1)

For the top 1 hit, 382 BGCs can be mapped to 83 GCF

png

<Figure size 1800x1800 with 0 Axes>

Annotate Network with information from BiG-SCAPE and GTDB

These network will only be meaningful when we enrich it with metadata. We can use information from our BiG-SCAPE runs, taxonomic information from GTDB-tk, and other tables generated by BGCflow.

# Enrich with BiG-SCAPE
df_annotation = pd.read_csv(bigscape_result, index_col=0)
df_annotation.loc[:, "bigslice_query"] = "query"
for i in data["gcf_id"].unique():
    df_annotation.loc[i, "bigslice_query"] = "model"

# enrich with GTDB
df_gtdb = pd.read_csv(gtdb_table).set_index("genome_id")
for i in df_annotation.index:
    genome_id = df_annotation.loc[i, "genome_id"]
    if type(genome_id) == str:
        for item in ["gtdb_release", "Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Organism"]:
            df_annotation.loc[i, item] = df_gtdb.loc[genome_id, item]

# enrich with bgc info - will be put in the main bgcflow code
bgc_info = df_bgc.copy()
bgc_info["bgc_id"] = [str(i).replace(".gbk", "") for i in bgc_info["orig_filename"]]
bgc_info = bgc_info.set_index("bgc_id")
for i in df_annotation.index:
    try:
        df_annotation.loc[i, "on_contig_edge"] = bgc_info.loc[i, "on_contig_edge"]
        df_annotation.loc[i, "length_nt"] = bgc_info.loc[i, "length_nt"]
    except KeyError:
        pass

df_annotation.head()

	product	bigscape_class	genome_id	accn_id	gcf_0.30	gcf_0.40	gcf_0.50	Clan Number	fam_id_0.30	fam_type_0.30	...	Domain	Phylum	Class	Order	Family	Genus	Species	Organism	length_nt
NC_018750.1.region001	ectoine	Others	GCF_000253235.1	NC_018750.1	1911.0	1911.0	1911.0	2045.0	2.0	known_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	10417.0
NC_018750.1.region002	terpene	Terpene	GCF_000253235.1	NC_018750.1	2109.0	2109.0	2109.0	NaN	14.0	unknown_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	20954.0
NC_018750.1.region003	T3PKS.NRPS.NRPS-like.T1PKS	PKS-NRP_Hybrids	GCF_000253235.1	NC_018750.1	1913.0	1913.0	1913.0	NaN	22.0	known_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	99490.0
NC_018750.1.region004	terpene.lanthipeptide-class-ii	Others	GCF_000253235.1	NC_018750.1	2111.0	2111.0	2111.0	36.0	20.0	unknown_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	29652.0
NC_018750.1.region005	lanthipeptide-class-iv	RiPPs	GCF_000253235.1	NC_018750.1	2399.0	2399.0	2399.0	552.0	17.0	known_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	22853.0

5 rows × 23 columns

df_annotation.to_csv("../tables/bigslice/enriched_query_annotation.csv")

Import the annotation to Cytoscape

Download the enriched_query_annotation.csv and import it to enrich the nodes in Cytoscape network. Using the new annotations, play around and explore the network to find interesting BGCs and their BiG-FAM models.

Key Points

Different BGCflow outputs can be combined to enrich BiG-SLICE query network

Exploring BGCFlow result with MKDocs

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How do I look at BGCFlow results?

Objectives

Starting an mkdocs server of BGCFlow results

Exploring and using BGCFlow Jupyter Notebooks template

Pre-requisite

Activate the BGCFlow conda environment (see Part 1 - Introduction)
```
conda activate bgcflow
```

Getting a demo report

Download an example report from Zenodo

wget -O Lactobacillus_delbrueckii_report.zip https://zenodo.org/record/7740349/files/Lactobacillus_delbrueckii_report.zip?download=1
unzip Lactobacillus_delbrueckii_report.zip

Starting BGCFlow mkdocs server

Move to the report directory and use bgcflow serve command to start up a http file server and mkdocs:

cd Lactobacillus_delbrueckii_report/Lactobacillus_delbrueckii
bgcflow serve --project Lactobacillus_delbrueckii

The report can be accessed in http://localhost:8001/

Key Points

For each BGCFlow projects, the result is structured as an interactive Markdown documentation that can be modified

Reports are made using Jupyter notebooks running on Python or R environments

Exploring BGCFlow result database using Metabase

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How do I explore and query tabular data from BGCFlow result?

Objectives

Start a local Metabase server

Loadi BGCFlow OLAP database (DuckDB) in metabase

Pre-requisite

Activate the BGCFlow conda environment and make sure you have a clone of BGCFlow installed locally (see Part 1 - Introduction)
```
cd <your BGCFlow path>
conda activate bgcflow
```
Download and extract a BGCFlow project demo (see Part 7)

Starting Metabase server

Make sure you are in your bgcflow directory. Metabase can then be started using:
```
bgcflow serve --metabase
```
Metabase will be launched in locally in http://localhost:3000/

Setting a local Metabase account

Follow the steps to set up your account
Choose DuckDB as the database type and give the DuckDB Path: /home/<my_user_name>/Lactobacillus_delbrueckii_report/Lactobacillus_delbrueckii/dbt_as_6.1.1/

Key Points

All of the tabular ouputs from BGCFlow are loaded and transformed into an OLAP database (DuckDB)

A business intelligence tools such as Metabase, can be used to interactively explore the database and build SQL queries

Customizing BGCFlow config for analyzing fungal genomes

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How can I run BGCFlow for fungal genomes using a custom configuration?

Objectives

Understand how to customize BGCFlow configuration for fungal genomes

Learn how to run BGCFlow with a specific configuration for fungal genome analysis

Pre-requisite

Activate the BGCFlow conda environment and make sure you have a clone of BGCFlow installed locally (see Part 1 - Introduction)

cd <your BGCFlow path>
git checkout dev-0.9.0-1
conda activate bgcflow

Make sure to have the bgcflow_wrapper version 0.3.5 or above:

bgcflow --version
bgcflow_wrapper, version 0.3.5

Set up Fungi Project Configuration

Inside the config folder, create a new project folder named fungi, the final project configuration will look like this:

config/
├── fungi
│   ├── project_config.yaml # BGCFlow project config
│   └── samples.csv # a csv files listing all your samples
└── config.yaml # BGCFlow global config

Create the project_config.yaml and edit the contain to have this values:

name: Fungi
pep_version: 2.1.0
description: "Example of a fungi project."
input_folder: "."
input_type: "gbk" # BGCFlow does not have annotation tools for eukaryotes, so we will use the genbank from NCBI
sample_table: samples.csv

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
  seqfu: TRUE
  mash: TRUE
  fastani: FALSE
  checkm: FALSE
  gtdbtk: FALSE
  prokka-gbk: FALSE
  antismash: TRUE
  query-bigslice: TRUE
  bigscape: TRUE
  bigslice: TRUE
  automlst-wrapper: FALSE
  arts: FALSE
  roary: FALSE
  eggnog: TRUE
  eggnog-roary: FALSE
  deeptfactor: FALSE
  deeptfactor-roary: FALSE
  cblaster-genome: TRUE
  cblaster-bgc: TRUE

Please note that some tools are not designed for for analysing fungal genomes.
Create or edit samples.csv to have this values:

genome_id,source,organism,genus,species,strain,closest_placement_reference,input_file
GCA_014117465.1,ncbi,,,,,,
GCA_014784225.2,ncbi,,,,,,
GCA_030515275.1,ncbi,,,,,,
GCA_014117485.1,ncbi,,,,,,

Update the global configuration

Now that we have our project ready, we need to update the global configuration so it is registered as one of the BGCFlow project to run
Edit the config/config.yaml file to include only the fungal project to run:

# This file should contain everything to configure the workflow on a global scale.

#### PROJECT INFORMATION ####
# This section control your project configuration.
# Each project are separated by "-".
# A project can be defined as (1) a yaml object or (2) a Portable Encapsulated Project (PEP) file.
# (1) To define project as a yaml object, it must contain the variable "name" and "samples".
#   - name : name of your project
#   - samples : a csv file containing a list of genome ids for analysis with multiple sources mentioned. Genome ids must be unique.
#   - rules: a yaml file containing project rule configurations. This will override global rule configuration.
#   - prokka-db (optional): list of the custom accessions to use as prokka reference database.
#   - gtdb-tax (optional): output summary file of GTDB-tk with "user_genome" and "classification" as the two minimum columns
# (2) To define project using PEP file, only variable "name" should be given that points to the location of the PEP yaml file.
#   - pep: path to PEP .yaml file. See project example_pep for details.
# PS: the variable pep and name is an alias

projects:
# Project 2 (PEP file)
  - pep: config/fungi/project_config.yaml

bgc_projects:
  - pep: config/lanthipeptide_lactobacillus/project_config.yaml
...

Running the main BGCFlow pipeline

Run the workflow by changing the config by:

snakemake --use-conda -c 8 --config "taxonomic_mode=fungi"

Key Points

BGCFlow can be customized for analyzing fungal genomes using specific configurations

Running BGCFlow with these configurations allows for targeted analysis of fungal genomes

BGCflow Tutorial

Introduction

Overview

Before you start

Access to a Linux Environment

Conda or Mamba Package Manager

Installing BGCFlow environments

Creating a Conda Environment

Installing the BGCFlow wrapper

Additional pre-requisites

Deploying a local copy of BGCFlow

Cloning BGCFlow Repository

Initiating configuration from template

Running the Lactobacillus tutorial dataset

Download the latest release/tags

Key Points

BGCflow project structure

Overview

Setting up your Project Config

Setting up your project samples

Define your project in the config.yaml

Checking the jobs with Snakemake dry-run

Challenges

Key Points

Selecting rules for analysis

Overview

Global Rules

Project specific rule

Challenges

Key Points

BGCflow data structure

Overview

Exploring the output of BGCflow

Which files are important?

Key Points

Part I: Exploring BiG-SLICE query result

Overview

Exploring BiG-SLICE Query result

1. Importing network and annotation from tables

2. Filtering top n hits to disentagle the network

Enriching the annotation with other results from BGCflow run

Getting Jupyter up and running

Key Points

Exploring BiG-SLICE query result

Overview

Table of Contents

Libraries & Functions

BGCflow Paths Configuration

Raw BiG-SLICE query hits

A glimpse of the data distribution

Sanity Check: How many gene clusters predicted by antiSMASH?

How many BGCs can be assigned to BiG-FAM gene cluster families?

A closer look to BiG-FAM distributions

Annotate Network with information from BiG-SCAPE and GTDB

Import the annotation to Cytoscape

Key Points

Exploring BGCFlow result with MKDocs

Overview

Pre-requisite

Getting a demo report

Starting BGCFlow mkdocs server

Key Points

Exploring BGCFlow result database using Metabase

Overview

Pre-requisite

Starting Metabase server

Setting a local Metabase account

Key Points

Customizing BGCFlow config for analyzing fungal genomes

Overview

Pre-requisite

Set up Fungi Project Configuration

Update the global configuration

Running the main BGCFlow pipeline

Key Points