Exploring BiG-SLICE query result

Overview

Teaching: 0 min
Exercises: 20 min

Questions

How do I annotate BiG-SLICE query result?

Objectives

Enrich BiG-FAM hits with other BGCflow results

In this episode, we will explore BiG-SLICE query hits of S. venezuelae genomes with the BiG-FAM database (version 1.0.0, run 6). You can download the .ipynb file of this episode and run it from your own directory.

BGCflow Paths Configuration
Raw BiG-SLICE query hits
A glimpse of the data distribution
Annotate Network with information from BiG-SCAPE and GTDB
Import the annotation to Cytoscape

Libraries & Functions

# load libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
from pathlib import Path
import json

# generate data
def gcf_hits(df_gcf, cutoff=900):
    """
    Filter bigslice result based on distance threshold to model and generate data.
    """
    mask = df_gcf.loc[:, "membership_value"] <= cutoff
    df_gcf_filtered = df_gcf[mask]
    bgcs = df_gcf_filtered.bgc_id.unique()
    gcfs = df_gcf_filtered.gcf_id.unique()
    print(
    f"""BiG-SLICE query with BiG-FAM run 6, distance cutoff {cutoff}
Number of bgc hits : {len(bgcs)}/{len(df_gcf.bgc_id.unique())}
Number of GCF hits : {len(gcfs)}""")
    return df_gcf_filtered

# visualization
def plot_overview(data):
    """
    Plot BGC hits distribution from BiG-SLICE Query
    """
    ranks = data.gcf_id.value_counts().index
    sns.set_theme()
    sns.set_context("paper")
    fig, axes = plt.subplots(1, 2, figsize=(25, 20))
    plt.figure(figsize = (25,25))

    # first plot
    sns.histplot(data=data, x='membership_value', 
                 kde=True, ax=axes[0],)

    # second plot
    sns.boxplot(data=data, y='gcf_id', x='membership_value', 
                orient='h', ax=axes[1], order=ranks)

    # Add in points to show each observation
    sns.stripplot(x="membership_value", y="gcf_id", data=data,
                  jitter=True, size=3, linewidth=0, color=".3", 
                  ax=axes[1], orient='h', order=ranks)
    return

BGCflow Paths Configuration

Customize the cell below to your BGCflow result paths

# interim data
bigslice_query = Path("/datadrive/home/matinnu/bgcflow_data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/")

# processed data
bigslice_query_processed = Path("/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/")
bigscape_result = Path("/datadrive/home/matinnu/bgcflow_data/processed/s_venezuelae/bigscape/for_cytoscape_antismash_6.0.1/2022-06-21 02_46_35_df_clusters_0.30.csv")
gtdb_table = Path("/datadrive/home/matinnu/bgcflow_data/processed/s_venezuelae/tables/df_gtdb_meta.csv")

# output path
output_path = Path("../tables/bigslice")
output_path.mkdir(parents=True, exist_ok=True)

Raw BiG-SLICE query hits

First, let’s see the raw data from BiG-SLICE query hits. We have extracted individual tables from the SQL database.

! tree /datadrive/bgcflow/data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/

[01;34m/datadrive/bgcflow/data/interim/bigslice/query/s_venezuelae_antismash_6.0.1/[00m
├── 1.db
├── bgc.csv
├── bgc_class.csv
├── bgc_features.csv
├── cds.csv
├── gcf_membership.csv
├── hsp.csv
├── hsp_alignment.csv
├── hsp_subpfam.csv
├── schema.csv
└── sqlite_sequence.csv

0 directories, 11 files

We will first look at these two tables:

bgc table or the file bgc.csv
gcf_membership table or the file gcf_membership.csv

The data from the interim folder has been processed for downstream analysis in the processed directory:

! tree /datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/

[01;34m/datadrive/bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/[00m
├── gcf_summary.csv
├── gcf_summary.json
└── query_network.csv

0 directories, 3 files

# load the two tables
df_bgc = pd.read_csv(bigslice_query / "bgc.csv")
# BGC table from bigslice
df_bgc.loc[:, 'genome_id'] = [Path(i).name for i in df_bgc.orig_folder] # will be put in the main bgcflow code
# GCF membership table from bigslice
df_gcf_membership = pd.read_csv(bigslice_query / "gcf_membership.csv")

The gcf_membership table lists the top 10 closest BiG-FAM GCF models (order shown as rank column) and the euclidean distance to the model (membership_value). Smaller membership_value means that our BGC has a closer or more similar features with the models. Note that we are querying against run 6 in the BiG-FAM model, with threshold of 900, Therefore, if the membership_value is above 900, it is less likely that our BGC belongs to that gcf model.

df_gcf_membership.head()

	gcf_id	bgc_id	membership_value	rank
0	203601	448	669	0
1	208355	448	798	1
2	200946	448	826	2
3	207795	448	882	3
4	213140	448	885	4

The bgc_id column in gcf_membership table refers to the id column in bgc table. Therefore, we can enrich our hits with the metadata contained in bgc table

df_bgc.head()

	id	name	type	length_nt	orig_folder	orig_filename	genome_id
0	1	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00863...	as6	29650	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00863...	NZ_CP029197.1.region004.gbk	GCF_008639165.1
1	2	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00870...	as6	11401	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00870...	NZ_CP029195.1.region011.gbk	GCF_008705255.1
2	3	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00025...	as6	10390	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00025...	NC_018750.1.region027.gbk	GCF_000253235.1
3	4	data/interim/bgcs/s_venezuelae/6.0.1/GCF_02104...	as6	25706	data/interim/bgcs/s_venezuelae/6.0.1/GCF_02104...	NZ_JAJNOJ010000002.1.region009.gbk	GCF_021044745.1
4	5	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00929...	as6	20390	data/interim/bgcs/s_venezuelae/6.0.1/GCF_00929...	NZ_CP023693.1.region020.gbk	GCF_009299385.1

A glimpse of the data distribution

Sanity Check: How many gene clusters predicted by antiSMASH?

print(f"There are {len(df_bgc)} BGCs predicted from {len(df_bgc.orig_folder.unique())} genomes.")

There are 515 BGCs predicted from 18 genomes.

How many BGCs can be assigned to BiG-FAM gene cluster families?

BiG-SLICE calculate the feature distance of a BGC to BiG-FAM models (which is a centroid of each Gene Cluster Families generated from 1.2 million BGCs). Though it’s not that accurate, it can give us a glimpse of the distribution of our BGCs within the database.

sns.set_theme()
sns.set_context("paper")
sns.histplot(df_gcf_membership, x='membership_value', kde=True)
for c in [900, 1200, 1500]:
    gcf_hits(df_gcf_membership, c)

BiG-SLICE query with BiG-FAM run 6, distance cutoff 900
Number of bgc hits : 382/515
Number of GCF hits : 114
BiG-SLICE query with BiG-FAM run 6, distance cutoff 1200
Number of bgc hits : 487/515
Number of GCF hits : 220
BiG-SLICE query with BiG-FAM run 6, distance cutoff 1500
Number of bgc hits : 504/515
Number of GCF hits : 351

png

Depending on the distance cutoffs, we can assign our BGCs to a different numbers of GCF model. The default cutoffs is 900 (run 6). In our data, 382 out of 515 BGCs can be assigned to 114 BiG-FAM GCF. Do note that the number of assigned GCF can be smaller if we only consider the first hit (the query returns 10 hits).

Smaller number means a closer distance to the GCF model. For further analysis, we will stick with the default cutoff.

A closer look to BiG-FAM distributions

BGCflow already cleans the data for downstream processing. The processed bigslice query can be found in bgcflow/data/processed/s_venezuelae/bigslice/query_as_6.0.1/.

data = pd.read_csv(bigslice_query_processed / "query_network.csv")
plot_overview(data)

png

<Figure size 1800x1800 with 0 Axes>

On the figure above, we can see the distance distribution of our query to the model, and how each models have varying degree of distances. But, this data includes the top 10 hits, so 1 BGCs can be assigned to multiple GCFs.

Let’s see again only for the first hit.

n_hits = 1 # get only the first hit
n_hits_only = data.loc[:, "rank"].isin(np.arange(n_hits))
data_1 = data[n_hits_only]
print(f"For the top {n_hits} hit, {len(data_1.bgc_id.unique())} BGCs can be mapped to {len(data_1.gcf_id.unique())} GCF")
plot_overview(data_1)

For the top 1 hit, 382 BGCs can be mapped to 83 GCF

png

<Figure size 1800x1800 with 0 Axes>

Annotate Network with information from BiG-SCAPE and GTDB

These network will only be meaningful when we enrich it with metadata. We can use information from our BiG-SCAPE runs, taxonomic information from GTDB-tk, and other tables generated by BGCflow.

# Enrich with BiG-SCAPE
df_annotation = pd.read_csv(bigscape_result, index_col=0)
df_annotation.loc[:, "bigslice_query"] = "query"
for i in data["gcf_id"].unique():
    df_annotation.loc[i, "bigslice_query"] = "model"

# enrich with GTDB
df_gtdb = pd.read_csv(gtdb_table).set_index("genome_id")
for i in df_annotation.index:
    genome_id = df_annotation.loc[i, "genome_id"]
    if type(genome_id) == str:
        for item in ["gtdb_release", "Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Organism"]:
            df_annotation.loc[i, item] = df_gtdb.loc[genome_id, item]

# enrich with bgc info - will be put in the main bgcflow code
bgc_info = df_bgc.copy()
bgc_info["bgc_id"] = [str(i).replace(".gbk", "") for i in bgc_info["orig_filename"]]
bgc_info = bgc_info.set_index("bgc_id")
for i in df_annotation.index:
    try:
        df_annotation.loc[i, "on_contig_edge"] = bgc_info.loc[i, "on_contig_edge"]
        df_annotation.loc[i, "length_nt"] = bgc_info.loc[i, "length_nt"]
    except KeyError:
        pass

df_annotation.head()

	product	bigscape_class	genome_id	accn_id	gcf_0.30	gcf_0.40	gcf_0.50	Clan Number	fam_id_0.30	fam_type_0.30	...	Domain	Phylum	Class	Order	Family	Genus	Species	Organism	length_nt
NC_018750.1.region001	ectoine	Others	GCF_000253235.1	NC_018750.1	1911.0	1911.0	1911.0	2045.0	2.0	known_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	10417.0
NC_018750.1.region002	terpene	Terpene	GCF_000253235.1	NC_018750.1	2109.0	2109.0	2109.0	NaN	14.0	unknown_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	20954.0
NC_018750.1.region003	T3PKS.NRPS.NRPS-like.T1PKS	PKS-NRP_Hybrids	GCF_000253235.1	NC_018750.1	1913.0	1913.0	1913.0	NaN	22.0	known_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	99490.0
NC_018750.1.region004	terpene.lanthipeptide-class-ii	Others	GCF_000253235.1	NC_018750.1	2111.0	2111.0	2111.0	36.0	20.0	unknown_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	29652.0
NC_018750.1.region005	lanthipeptide-class-iv	RiPPs	GCF_000253235.1	NC_018750.1	2399.0	2399.0	2399.0	552.0	17.0	known_family	...	d__Bacteria	p__Actinobacteriota	c__Actinomycetia	o__Streptomycetales	f__Streptomycetaceae	g__Streptomyces	venezuelae	s__Streptomyces venezuelae	22853.0

5 rows × 23 columns

df_annotation.to_csv("../tables/bigslice/enriched_query_annotation.csv")

Import the annotation to Cytoscape

Download the enriched_query_annotation.csv and import it to enrich the nodes in Cytoscape network. Using the new annotations, play around and explore the network to find interesting BGCs and their BiG-FAM models.

Key Points

Different BGCflow outputs can be combined to enrich BiG-SLICE query network

previous episode

BGCflow Tutorial

next episode