This lesson is in the early stages of development (Alpha version)

Exploring BiG-SLICE query result

Overview

Teaching: 0 min
Exercises: 20 min
Questions
  • How do I annotate BiG-SLICE query result?

Objectives
  • Enrich BiG-FAM hits with other BGCflow results

In this episode, we will explore BiG-SLICE query hits of S. venezuelae genomes with the BiG-FAM database (version 1.0.0, run 6). You can download the .ipynb file of this episode and run it from your own directory.

Table of Contents

  1. BGCflow Paths Configuration
  2. Raw BiG-SLICE query hits
  3. A glimpse of the data distribution
  4. Annotate Network with information from BiG-SCAPE and GTDB
  5. Import the annotation to Cytoscape

Libraries & Functions

# load libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
from pathlib import Path
import json
# generate data
def gcf_hits(df_gcf, cutoff=900):
    """
    Filter bigslice result based on distance threshold to model and generate data.
    """
    mask = df_gcf.loc[:, "membership_value"] <= cutoff
    df_gcf_filtered = df_gcf[mask]
    bgcs = df_gcf_filtered.bgc_id.unique()
    gcfs = df_gcf_filtered.gcf_id.unique()
    print(
    f"""BiG-SLICE query with BiG-FAM run 6, distance cutoff {cutoff}
Number of bgc hits : {len(bgcs)}/{len(df_gcf.bgc_id.unique())}
Number of GCF hits : {len(gcfs)}""")
    return df_gcf_filtered

# visualization
def plot_overview(data):
    """
    Plot BGC hits distribution from BiG-SLICE Query
    """
    ranks = data.gcf_id.value_counts().index
    sns.set_theme()
    sns.set_context("paper")
    fig, axes = plt.subplots(1, 2, figsize=(25, 20))
    plt.figure(figsize = (25,25))

    # first plot
    sns.histplot(data=data, x='membership_value', 
                 kde=True, ax=axes[0],)

    # second plot
    sns.boxplot(data=data, y='gcf_id', x='membership_value', 
                orient='h', ax=axes[1], order=ranks)

    # Add in points to show each observation
    sns.stripplot(x="membership_value", y="gcf_id", data=data,
                  jitter=True, size=3, linewidth=0, color=".3", 
                  ax=axes[1], orient='h', order=ranks)
    return

BGCflow Paths Configuration

Customize the cell below to your BGCflow result paths

# interim data
bigslice_query = Path("/datadrive/home/matinnu/bgcflow_data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/")

# processed data
bigslice_query_processed = Path("/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/")
bigscape_result = Path("/datadrive/home/matinnu/bgcflow_data/processed/s_venezuelae/bigscape/for_cytoscape_antismash_6.0.1/2022-06-21 02_46_35_df_clusters_0.30.csv")
gtdb_table = Path("/datadrive/home/matinnu/bgcflow_data/processed/s_venezuelae/tables/df_gtdb_meta.csv")

# output path
output_path = Path("../tables/bigslice")
output_path.mkdir(parents=True, exist_ok=True)

Raw BiG-SLICE query hits

First, let’s see the raw data from BiG-SLICE query hits. We have extracted individual tables from the SQL database.

! tree /datadrive/bgcflow/data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/
/datadrive/bgcflow/data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/
├── 1.db
├── bgc.csv
├── bgc_class.csv
├── bgc_features.csv
├── cds.csv
├── gcf_membership.csv
├── hsp.csv
├── hsp_alignment.csv
├── hsp_subpfam.csv
├── schema.csv
└── sqlite_sequence.csv

0 directories, 11 files

We will first look at these two tables:

The data from the interim folder has been processed for downstream analysis in the processed directory:

! tree /datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/
/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/
├── gcf_summary.csv
├── gcf_summary.json
└── query_network.csv

0 directories, 3 files
# load the two tables
df_bgc = pd.read_csv(bigslice_query / "bgc.csv")
# BGC table from bigslice
df_bgc.loc[:, 'genome_id'] = [Path(i).name for i in df_bgc.orig_folder] # will be put in the main bgcflow code
# GCF membership table from bigslice
df_gcf_membership = pd.read_csv(bigslice_query / "gcf_membership.csv")

The gcf_membership table lists the top 10 closest BiG-FAM GCF models (order shown as rank column) and the euclidean distance to the model (membership_value). Smaller membership_value means that our BGC has a closer or more similar features with the models. Note that we are querying against run 6 in the BiG-FAM model, with threshold of 900, Therefore, if the membership_value is above 900, it is less likely that our BGC belongs to that gcf model.

df_gcf_membership.head()
gcf_id bgc_id membership_value rank
0 203601 448 669 0
1 208355 448 798 1
2 200946 448 826 2
3 207795 448 882 3
4 213140 448 885 4

The bgc_id column in gcf_membership table refers to the id column in bgc table. Therefore, we can enrich our hits with the metadata contained in bgc table

df_bgc.head()
id name type on_contig_edge length_nt orig_folder orig_filename genome_id
0 1 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00863... as6 0 29650 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00863... NZ_CP029197.1.region004.gbk GCF_008639165.1
1 2 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00870... as6 0 11401 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00870... NZ_CP029195.1.region011.gbk GCF_008705255.1
2 3 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00025... as6 0 10390 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00025... NC_018750.1.region027.gbk GCF_000253235.1
3 4 data/interim/bgcs/s_venezuelae/6.0.1/GCF_02104... as6 0 25706 data/interim/bgcs/s_venezuelae/6.0.1/GCF_02104... NZ_JAJNOJ010000002.1.region009.gbk GCF_021044745.1
4 5 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00929... as6 0 20390 data/interim/bgcs/s_venezuelae/6.0.1/GCF_00929... NZ_CP023693.1.region020.gbk GCF_009299385.1

A glimpse of the data distribution

Sanity Check: How many gene clusters predicted by antiSMASH?

print(f"There are {len(df_bgc)} BGCs predicted from {len(df_bgc.orig_folder.unique())} genomes.")
There are 515 BGCs predicted from 18 genomes.

How many BGCs can be assigned to BiG-FAM gene cluster families?

BiG-SLICE calculate the feature distance of a BGC to BiG-FAM models (which is a centroid of each Gene Cluster Families generated from 1.2 million BGCs). Though it’s not that accurate, it can give us a glimpse of the distribution of our BGCs within the database.

sns.set_theme()
sns.set_context("paper")
sns.histplot(df_gcf_membership, x='membership_value', kde=True)
for c in [900, 1200, 1500]:
    gcf_hits(df_gcf_membership, c)
BiG-SLICE query with BiG-FAM run 6, distance cutoff 900
Number of bgc hits : 382/515
Number of GCF hits : 114
BiG-SLICE query with BiG-FAM run 6, distance cutoff 1200
Number of bgc hits : 487/515
Number of GCF hits : 220
BiG-SLICE query with BiG-FAM run 6, distance cutoff 1500
Number of bgc hits : 504/515
Number of GCF hits : 351

png

Depending on the distance cutoffs, we can assign our BGCs to a different numbers of GCF model. The default cutoffs is 900 (run 6). In our data, 382 out of 515 BGCs can be assigned to 114 BiG-FAM GCF. Do note that the number of assigned GCF can be smaller if we only consider the first hit (the query returns 10 hits).

Smaller number means a closer distance to the GCF model. For further analysis, we will stick with the default cutoff.

A closer look to BiG-FAM distributions

BGCflow already cleans the data for downstream processing. The processed bigslice query can be found in bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/.

data = pd.read_csv(bigslice_query_processed / "query_network.csv")
plot_overview(data)

png

<Figure size 1800x1800 with 0 Axes>

On the figure above, we can see the distance distribution of our query to the model, and how each models have varying degree of distances. But, this data includes the top 10 hits, so 1 BGCs can be assigned to multiple GCFs.

Let’s see again only for the first hit.

n_hits = 1 # get only the first hit
n_hits_only = data.loc[:, "rank"].isin(np.arange(n_hits))
data_1 = data[n_hits_only]
print(f"For the top {n_hits} hit, {len(data_1.bgc_id.unique())} BGCs can be mapped to {len(data_1.gcf_id.unique())} GCF")
plot_overview(data_1)
For the top 1 hit, 382 BGCs can be mapped to 83 GCF

png

<Figure size 1800x1800 with 0 Axes>

Annotate Network with information from BiG-SCAPE and GTDB

These network will only be meaningful when we enrich it with metadata. We can use information from our BiG-SCAPE runs, taxonomic information from GTDB-tk, and other tables generated by BGCflow.

# Enrich with BiG-SCAPE
df_annotation = pd.read_csv(bigscape_result, index_col=0)
df_annotation.loc[:, "bigslice_query"] = "query"
for i in data["gcf_id"].unique():
    df_annotation.loc[i, "bigslice_query"] = "model"

# enrich with GTDB
df_gtdb = pd.read_csv(gtdb_table).set_index("genome_id")
for i in df_annotation.index:
    genome_id = df_annotation.loc[i, "genome_id"]
    if type(genome_id) == str:
        for item in ["gtdb_release", "Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Organism"]:
            df_annotation.loc[i, item] = df_gtdb.loc[genome_id, item]

# enrich with bgc info - will be put in the main bgcflow code
bgc_info = df_bgc.copy()
bgc_info["bgc_id"] = [str(i).replace(".gbk", "") for i in bgc_info["orig_filename"]]
bgc_info = bgc_info.set_index("bgc_id")
for i in df_annotation.index:
    try:
        df_annotation.loc[i, "on_contig_edge"] = bgc_info.loc[i, "on_contig_edge"]
        df_annotation.loc[i, "length_nt"] = bgc_info.loc[i, "length_nt"]
    except KeyError:
        pass
df_annotation.head()
product bigscape_class genome_id accn_id gcf_0.30 gcf_0.40 gcf_0.50 Clan Number fam_id_0.30 fam_type_0.30 ... Domain Phylum Class Order Family Genus Species Organism on_contig_edge length_nt
NC_018750.1.region001 ectoine Others GCF_000253235.1 NC_018750.1 1911.0 1911.0 1911.0 2045.0 2.0 known_family ... d__Bacteria p__Actinobacteriota c__Actinomycetia o__Streptomycetales f__Streptomycetaceae g__Streptomyces venezuelae s__Streptomyces venezuelae 0.0 10417.0
NC_018750.1.region002 terpene Terpene GCF_000253235.1 NC_018750.1 2109.0 2109.0 2109.0 NaN 14.0 unknown_family ... d__Bacteria p__Actinobacteriota c__Actinomycetia o__Streptomycetales f__Streptomycetaceae g__Streptomyces venezuelae s__Streptomyces venezuelae 0.0 20954.0
NC_018750.1.region003 T3PKS.NRPS.NRPS-like.T1PKS PKS-NRP_Hybrids GCF_000253235.1 NC_018750.1 1913.0 1913.0 1913.0 NaN 22.0 known_family ... d__Bacteria p__Actinobacteriota c__Actinomycetia o__Streptomycetales f__Streptomycetaceae g__Streptomyces venezuelae s__Streptomyces venezuelae 0.0 99490.0
NC_018750.1.region004 terpene.lanthipeptide-class-ii Others GCF_000253235.1 NC_018750.1 2111.0 2111.0 2111.0 36.0 20.0 unknown_family ... d__Bacteria p__Actinobacteriota c__Actinomycetia o__Streptomycetales f__Streptomycetaceae g__Streptomyces venezuelae s__Streptomyces venezuelae 0.0 29652.0
NC_018750.1.region005 lanthipeptide-class-iv RiPPs GCF_000253235.1 NC_018750.1 2399.0 2399.0 2399.0 552.0 17.0 known_family ... d__Bacteria p__Actinobacteriota c__Actinomycetia o__Streptomycetales f__Streptomycetaceae g__Streptomyces venezuelae s__Streptomyces venezuelae 0.0 22853.0

5 rows × 23 columns

df_annotation.to_csv("../tables/bigslice/enriched_query_annotation.csv")

Import the annotation to Cytoscape

Download the enriched_query_annotation.csv and import it to enrich the nodes in Cytoscape network. Using the new annotations, play around and explore the network to find interesting BGCs and their BiG-FAM models.

Key Points

  • Different BGCflow outputs can be combined to enrich BiG-SLICE query network