Custom tutorial
This tutorial aims to show the flexibility of modifying, or generating markdown reports from BGCFlow template notebooks.
We will go through a simple analysis of filtering results from different genome mining tools and finding our BGC of interest.
Prerequisites
This tutorial utilizes the example dataset used in the BGCFlow manuscript
To run this tutorial, you will need to finish the mq_saccharopolyspora
run and build the report.
MQ_Saccharopolyspora Project Configuration
Create a new project configuration folder in the config
folder and copy all the necessary project files there. Then run and build the report as follows:
# Finish running BGCFlow
bgcflow run
# Build the report
bgcflow build report
Once the run finishes, check the result in the processed directory (data/processed/mq_saccharopolyspora
), it should generate a folder structure similar to this:
.
├── antismash
├── bigscape
├── bigslice
├── data_warehouse
├── docs
│ ├── antismash.ipynb
│ ├── antismash.md
│ ├── arts.ipynb
│ ├── arts.md
│ ├── assets
│ ├── bigscape.ipynb
│ ├── bigscape.md
│ ├── index.md
│ ├── query-bigslice.ipynb
│ └── query-bigslice.md
├── log_changes
├── main.py
├── metadata
├── mkdocs.yml
├── overrides
├── README.md
└── tables
BGCFlow contains a starter Jupyter notebook template, generated in the docs
folder, which are then converted to a markdown
file, which can be used by mkdocs
to generate a static HTML site.
For more details about customizing the HTML report, see https://squidfunk.github.io/mkdocs-material/
Default conda environments for the notebooks
When you run bgcflow build report
, you can actually see which conda environments are being used Snakemake
to generate the report. You can either use this existing environments, or set it up on your own from the recipe.
There are two environments that BGCFlow used to make reports, a Python
and a mix of R
and Python
environment:
- Python - bgcflow_notes.yaml
- R - r_notebook.yaml
You can install the conda environments using those yaml file by:
# use conda or mamba
mamba env create -f bgcflow_notes.yaml # or r_notebook.yaml
Adding a new notebook to the analysis
To add your own analysis, you can start up Jupyter session using the environments above, and create a new notebook inside the docs
folder. You can also download this notebook tutorial here:
Let's give our notebook the name custom_tutorial.ipynb
Tutorial 1: Using Tables to Filter and Identify Unique Polyketide BGCs
In this notebook, we aim to identify unique polyketide biosynthetic gene clusters (BGCs) that do not have hits in BigFam or MIBiG databases but have been identified in ARTS. This analysis can be used as the first step in the exploratory analysis of discovering potentially novel polyketide BGCs with unique functions.
Setting up libraries and environment variables
We will start by use Python and pandas for data manipulation and filtering.
First, let's import the necessary Python libraries and define the directory containing our report data.
??? code "Code" ```python from pathlib import Path import json import pandas as pd
from IPython.display import display, Markdown, HTML
from itables import to_html_datatable as DT
import itables.options as opt
opt.css = """
.itables table td { font-style: italic; font-size: .8em;}
.itables table th { font-style: oblique; font-size: .8em; }
"""
opt.classes = ["display", "compact"]
opt.lengthMenu = [5, 10, 20, 50, 100, 200, 500]
# Define the directory containing the report
report_directory = Path("../")
```
Note from the code that the report directory is located one directory above of this notebook (which is located in the docs
folder)
The metadata
folder records some of the important software versions and also other information of the BGCFlow runs. First, we will fetch the antiSMASH
version used in the run from the metadata.
??? code "Code" ```python # Load the dependency versions dependency_versions_file = report_directory / "metadata/dependency_versions.json" with open(dependency_versions_file, "r") as file: dependency_versions = json.load(file)
# Extract the version of antiSMASH used
antismash_version = dependency_versions["antismash"]
display(Markdown(f"> antiSMASH version is: `{antismash_version}`"))
```
antiSMASH version is:
7.1.0
Setting up input files
First, we need to find all the necessary tables or files required for our analysis. Here, we will need the results from antiSMASH, BiG-SCAPE, BiG-FAM query, and also ARTS2.
Most of the tables can be found in the tables
folder, such as the antiSMASH summary region tables:
??? code "Code"
python
# Define the paths to the input files
antismash_regions_file = report_directory / f"tables/df_regions_antismash_{antismash_version}.csv"
display(Markdown(f">`{antismash_regions_file}`"))
../tables/df_regions_antismash_7.1.0.csv
Some other tables are located in their specific directories, such BiG-SCAPE
??? code "Code" ```python # Define the directory containing the BIG-SCAPE data bigscape_directory = report_directory / f"bigscape/for_cytoscape_antismash_{antismash_version}/"
# Find the cluster table file
cluster_table_file = [i for i in bigscape_directory.glob("*_df_clusters_0.30.csv")][0]
display(Markdown(f">`{cluster_table_file}`"))
```
../bigscape/for_cytoscape_antismash_7.1.0/2024-04-22 20_44_19_df_clusters_0.30.csv
??? code "Code"
python
# Define the directory containing the query data
bigfam_query = report_directory / f"bigslice/query_as_{antismash_version}/query_network.csv"
display(Markdown(f">`{bigfam_query}`"))
../bigslice/query_as_7.1.0/query_network.csv
Other tables are created by the Jupyter notebook templates, and usually can be interactively downloaded from the HTML report. By convention, this are stored in the assets
directory within the docs
folder.
??? code "Code"
python
arts_table_file = Path(f"assets/tables/arts_hits_as{antismash_version}.csv")
display(Markdown(f">`{arts_table_file}`"))
assets/tables/arts_hits_as7.1.0.csv
Using pandas and itables to show tables interactively
While pandas can show nice summary of a table, it does not do so interactively, and does not render well in the HTML report. We can use itables to display our tables as interactive datatables that we can sort, paginate, scroll or filter. To enable this feature in the final report, we are converting the tables to HTML datatables and displaying it with iPython.
??? code "Code" ```python # Load the data from the input files df_antismash_regions = pd.read_csv(antismash_regions_file) # Correct similarity values and fill null values with 0 df_antismash_regions["similarity"] = df_antismash_regions["similarity"].apply(lambda x: 1 if x > 1 else x).fillna(0)
df_bigscape = pd.read_csv(cluster_table_file)
df_arts_hits = pd.read_csv(arts_table_file)
df_bigfam_hits = pd.read_csv(bigfam_query )
```
??? code "Code"
python
display(Markdown(f">`{antismash_regions_file}`"))
display(HTML(DT(df_antismash_regions, scrollX=True)))
../tables/df_regions_antismash_7.1.0.csv
bgc_id | genome_id | region | accession | start_pos | end_pos | contig_edge | product | region_length | most_similar_known_cluster_id | most_similar_known_cluster_description | most_similar_known_cluster_type | similarity | source | gbk_path |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
??? code "Code"
python
display(Markdown(f">`{cluster_table_file}`"))
display(HTML(DT(df_bigscape, scrollX=True)))
../bigscape/for_cytoscape_antismash_7.1.0/2024-04-22 20_44_19_df_clusters_0.30.csv
bgc_id | product | bigscape_class | genome_id | accn_id | gcf_0.30 | gcf_0.40 | gcf_0.50 | Clan Number | fam_id_0.30 | fam_type_0.30 | fam_known_compounds_0.30 |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
??? code "Code"
python
display(Markdown(f">`{arts_table_file}`"))
display(HTML(DT(df_arts_hits, scrollX=True)))
assets/tables/arts_hits_as7.1.0.csv
pkey | profile | genome_id | name | product | start | stop | strand | scaffold | sequence_id | type | transl_table | core_table_fkey | Dup | HGT | Res | bgc_id | gene | description | function | bgctable_fkey | BGC | duptable_fkey | evalue | bitscore | gid | seqtitle | hits |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
??? code "Code"
python
display(Markdown(f">`{bigfam_query}`"))
display(HTML(DT(df_bigfam_hits, scrollX=True)))
../bigslice/query_as_7.1.0/query_network.csv
gcf_id | bgc_id | membership_value | rank |
---|---|---|---|
Loading... (need help?) |
Note that these tables are downsampled by default to prevent the Markdown report become heavy. See https://mwouts.github.io/itables/downsampling.html for more details.
We do not recommend to use the HTML reports to show big tables as it's purpose is to give a quick summary of the analysis.
Exploratory Data Analysis
BiG-SCAPE filtering
Let's start by looking at the BiG-SCAPE category of unknown PKS. We will create a filtering following this logic:
-
Define Classes to Filter By: We will select all possible PKS categories in BiG-SCAPE and put it in a list named bigscape_class, containing PKSI, PKSother, and PKS-NRP_Hybrids. To see all available bigscape_class, do:
df_bigscape["bigscape_class"].unique()
-
Filter for unknown families: The column
fam_type_0.30
defines whether the BGCs belongs to a known or unknown GCFs using the cutoff value of 0.3. We will set the string variable family_type to "unknown_family", which then used to filter rows based on a column in the DataFrame related to family types. -
Create Masks for Filtering:
-
mask1: This is a boolean mask created by checking if the values in the column fam_type_0.30 of df_bigscape are equal to family_type ("unknown_family"). This mask is true for rows where the family type is unknown.
-
mask2: This is another boolean mask created by checking if the values in the column bigscape_class are within the list bigscape_class defined at the start. This mask is true for rows that match any of the specified classes (PKSI, PKSother, PKS-NRP_Hybrids).
-
-
Filter DataFrame: The DataFrame df_bigscape is filtered using the logical AND (&) of mask1 and mask2. This means only rows where both conditions are true (i.e., the family type is "unknown_family" and the bigscape class is one of the specified classes) are selected.
??? code "Code"
python
bigscape_class = ['PKSI', 'PKSother', 'PKS-NRP_Hybrids']
family_type = "unknown_family"
mask1 = df_bigscape["fam_type_0.30"] == family_type
mask2 = df_bigscape["bigscape_class"].isin(bigscape_class)
df_bigscape_PKS_unknown = df_bigscape[mask1 & mask2]
display(HTML(DT(df_bigscape_PKS_unknown, scrollX=True)))
bgc_id | product | bigscape_class | genome_id | accn_id | gcf_0.30 | gcf_0.40 | gcf_0.50 | Clan Number | fam_id_0.30 | fam_type_0.30 | fam_known_compounds_0.30 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
??? code "Code" ```python number_of_gcf = len(df_bigscape_PKS_unknown["fam_id_0.30"].unique())
text = f"""We found {df_bigscape.shape[0]} BGC regions in the category of {', '.join(bigscape_class)} which belongs to {family_type.replace('_', ' ')}.\
These BGCs can be grouped into {len(df_bigscape_PKS_unknown["fam_id_0.30"].unique())} GCFs."""
Markdown(text)
```
We found 739 BGC regions in the category of PKSI, PKSother, PKS-NRP_Hybrids which belongs to unknown family. These BGCs can be grouped into 70 GCFs.
AntiSMASH KnownClusterBlast filtering
We will now look at the KnownClusterBlast similarity score and decide if we need to further filter our search. We will subset the antiSMASH BGC regions table to only contains BGCs identified in the previous step and observe the KnownClusterBlast similarity values. We will narrow down our search by setting up a threshold for the similarity score.
??? code "Code"
python
# Extract the BGC IDs of the unknown families
unknown_family_bgcs = df_bigscape_PKS_unknown.bgc_id.to_list()
??? code "Code"
python
mask_unknown_family = df_antismash_regions.bgc_id.isin(unknown_family_bgcs)
columns_to_show = ["bgc_id", "genome_id", "product", "similarity",
"most_similar_known_cluster_id", "most_similar_known_cluster_description", "region", "contig_edge", "region_length", ]
display(HTML(DT(df_antismash_regions[mask_unknown_family].loc[:, columns_to_show], scrollX=True)))
bgc_id | genome_id | product | similarity | most_similar_known_cluster_id | most_similar_known_cluster_description | region | contig_edge | region_length | |
---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
??? code "Code"
python
similarity_cutoff = 0.30
mask_similarity = df_antismash_regions.similarity < similarity_cutoff
text2 = f"From those **{df_antismash_regions[mask_unknown_family].shape[0]} BGCs**, we actually still have a few BGCs with high similarity based on KnownClusterBlast.\
Let us focus on BGCs with low similarity **(<{similarity_cutoff})**, and now we are left with **{df_antismash_regions[mask_unknown_family & mask_similarity].shape[0]} BGCs**."
Markdown(text2)
From those 123 BGCs, we actually still have a few BGCs with high similarity based on KnownClusterBlast. Let us focus on BGCs with low similarity (<0.3), and now we are left with 67 BGCs.
??? code "Code"
python
display(HTML(DT(df_antismash_regions[mask_unknown_family & mask_similarity].loc[:, columns_to_show], scrollX=True)))
bgc_id | genome_id | product | similarity | most_similar_known_cluster_id | most_similar_known_cluster_description | region | contig_edge | region_length | |
---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
BiG-FAM query filtering
We will also remove all BGCs that has a match to other BGCs in the BiG-FAM database.
??? code "Code"
python
mask_no_bigfam_hit = ~df_antismash_regions.bgc_id.isin(df_bigfam_hits.bgc_id.unique())
display(HTML(DT(df_antismash_regions[mask_unknown_family & mask_similarity & mask_no_bigfam_hit].loc[:, columns_to_show], scrollX=True)))
bgc_id | genome_id | product | similarity | most_similar_known_cluster_id | most_similar_known_cluster_description | region | contig_edge | region_length | |
---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
ARTS hits filtering
We will now select only BGCs with close proximity to one of the ARTS2 model. Note that ARTS2 actually have several criteria to prioritise BGCs with possible antibiotic activity, here we just search for all BGCs that has hits to ARTS2 profile.
??? code "Code"
python
mask_arts_hit = df_antismash_regions.bgc_id.isin(df_arts_hits.bgc_id.dropna().unique())
df_pks_filtered = df_antismash_regions[mask_unknown_family & mask_similarity & mask_no_bigfam_hit & mask_arts_hit]
display(HTML(DT(df_pks_filtered.loc[:, columns_to_show], scrollX=True)))
bgc_id | genome_id | product | similarity | most_similar_known_cluster_id | most_similar_known_cluster_description | region | contig_edge | region_length | |
---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Conclusion for Tutorial 1
??? code "Code"
python
text3 = f"We are now left with **{df_pks_filtered.shape[0]} PKS BGCs** of interest which can be further invesigated in-depth."
Markdown(text3)
We are now left with 15 PKS BGCs of interest which can be further invesigated in-depth.
This tutorial help us narrow down our search to potentially novel PKS. Nevertheless, an in-depth investigation to the BGC structure and see if it has all the required genes to perform natural product biosynthesis.
We also recommend users to try out Metabase
for a more interactive experience of exploring the datasets. See the how to guide in our WiKi
Tutorial 2: Exploring Integrated Network
In the previous tutorial, we have shown how to create a custom notebook, explore, and filter the data using tables generated by BGCFlow.
In this notebook, we will do an integrated analysis utilizing networks of knowledge bases from different tools: BiG-SCAPE
, antiSMASH KnownClusterBlast
, BiG-FAM database query
, and ARTS2
. We will generate a graphml
file with annotation that can be loaded to Cytoscape and also also attempt to visualize the network in this notebook.
??? code "Code" ```python import pandas as pd import networkx as nx from pathlib import Path import json import numpy as np import plotly.graph_objects as go import seaborn as sns from IPython.display import display, Markdown, HTML
from itables import to_html_datatable as DT
import itables.options as opt
opt.css = """
.itables table td { font-style: italic; font-size: .8em;}
.itables table th { font-style: oblique; font-size: .8em; }
"""
opt.classes = ["display", "compact"]
opt.lengthMenu = [5, 10, 20, 50, 100, 200, 500]
def create_node_trace(G, node_trace_category, color, showtextlabel=False, nodesize=10, nodeopacity=0.8,
nodesymbol="circle", linewidth=1, linecolor="black", textposition="top center", showlegend=False):
"""
Create a node trace for a given graph.
Parameters:
G (networkx.Graph): The graph to create the node trace for.
node_trace_category (str): The category of the node trace.
color (str): The color of the nodes.
showtextlabel (bool): Whether to show text labels for the nodes.
nodesize (int): The size of the nodes.
nodeopacity (float): The opacity of the nodes.
nodesymbol (str): The symbol used for the nodes.
linewidth (int): The width of the lines.
linecolor (str): The color of the lines.
textposition (str): The position of the text labels.
showlegend (bool): Whether to show the legend.
Returns:
go.Scatter: The node trace.
"""
if showtextlabel:
markermode = "markers+text"
else:
markermode = "markers"
nodes = np.array([node for node in G.nodes() if G.nodes[node]["node_trace"] == node_trace_category])
pos = np.array([G.nodes[node]['pos'] for node in nodes.flatten()]).reshape(-1, 2)
xs, ys = pos[:, 0], pos[:, 1]
texts = np.array([G.nodes[node]['text'] for node in nodes if "text" in G.nodes[node].keys()])
node_trace = go.Scatter(
x=xs.tolist(),
y=ys.tolist(),
text=texts.tolist(),
textposition=textposition,
mode=markermode,
hoverinfo='text',
name=node_trace_category,
showlegend=showlegend,
marker=dict(
symbol=nodesymbol,
opacity=nodeopacity,
showscale=False,
color=color,
size=nodesize,
line=dict(width=linewidth, color=linecolor)))
return node_trace
def create_edge_trace(Graph, name, showlegend=False, color='#888', width=0.5, opacity=0.8, dash="solid"):
"""
Create an edge trace for a given graph.
Parameters:
Graph (networkx.Graph): The graph to create the edge trace for.
name (str): The name of the edge trace.
showlegend (bool): Whether to show the legend.
color (str): The color of the edges.
width (float): The width of the edges.
opacity (float): The opacity of the edges.
dash (str): The style of the edges.
Returns:
go.Scatter: The edge trace.
"""
edge_trace = go.Scatter(
x=[],
y=[],
name=name,
opacity=opacity,
line=dict(width=width,color=color, dash=dash),
hoverinfo='none',
mode='lines',
showlegend=showlegend)
edges = np.array([edge for edge in Graph.edges() if G.edges[edge]["relation_type"] == name])
pos = np.array([Graph.nodes[e]['pos'] for e in edges.flatten()]).reshape(-1, 2)
xs = np.insert(pos[:, 0], np.arange(2, len(pos[:, 0]), 2), None)
ys = np.insert(pos[:, 1], np.arange(2, len(pos[:, 1]), 2), None)
edge_trace['x'] = xs
edge_trace['y'] = ys
return edge_trace
def get_graph_stats(g, graph_name):
"""
Get statistics for a given graph.
Parameters:
g (networkx.Graph): The graph to get statistics for.
graph_name (str): The name of the graph.
Returns:
str: A string containing the statistics for the graph.
"""
num_nodes_g = g.number_of_nodes()
num_edges_g = g.number_of_edges()
avg_degree_g = sum(dict(g.degree()).values()) / num_nodes_g
num_bgc_g = sum(1 for _, data in g.nodes(data=True) if data.get('node_trace') == 'BGC')
return f" - {graph_name} : **{num_nodes_g}** total nodes (**{num_bgc_g}** BGC nodes), **{num_edges_g}** edges, {avg_degree_g:.2f} average degree"
```
Handling Graphs: Reading, Filtering, and Merging Networks
We will start by defining some parameters for filtering:
??? code "Code"
python
bigscape_cutoff = "0.30"
bigfam_rank_filter = 1 # only select first hits of bigfam models
knownclusterblast_similarity_cutoff = 0.8 # select antismash knownclusterblast hits above 0.7
And then we will first list all the files that we need:
??? code "Code" ```python report_dir = Path("../") dependency_version = report_dir / "metadata/dependency_versions.json" with open(dependency_version, "r") as file: dependency_version = json.load(file)
antismash_version = dependency_version["antismash"]
assets_dir = report_dir / "docs/assets"
integrated_bigscape_network = assets_dir / f"data/bigscape_{bigscape_cutoff}_as{antismash_version}.graphml"
bigfam_network = assets_dir / f"data/query_bigfam_as{antismash_version}_network.json"
arts_hits = assets_dir / f"tables/arts_hits_as{antismash_version}.csv"
```
We will start by reading the input files into memory:
??? code "Code" ```python # read ARTS table df_arts_hits = pd.read_csv(arts_hits)
integrated_bigscape_network_graph = nx.read_graphml(integrated_bigscape_network)
# Iterate over the nodes of bigscape_network_graph_knownclusterblast
for n, data in integrated_bigscape_network_graph.nodes(data=True):
# If node_trace is not in the node's attributes, add it with a default value
if 'node_trace' not in data:
if n.startswith('BGC'):
integrated_bigscape_network_graph.nodes[n]['node_trace'] = 'MIBIG_BiG-SCAPE'
else:
integrated_bigscape_network_graph.nodes[n]['node_trace'] = 'BGC'
with open(bigfam_network, "r") as f:
graph_data = json.load(f)
bigfam_network_graph = nx.readwrite.json_graph.node_link_graph(graph_data)
for n, data in bigfam_network_graph.nodes(data=True):
bigfam_network_graph.nodes[n]['node_trace'] = bigfam_network_graph.nodes[n]['node_type']
bigfam_network_graph = nx.relabel_nodes(bigfam_network_graph, lambda x: str(x))
```
And then continue by filtering the networks based on our cutoff definition.
In this first part, we will split again the BiG-SCAPE - antiSMASH KnownClusterBlast network into its parts:
??? code "Code" ```python # Filter edges with relation_type = 'bigscape_similarity' edges_with_bigscape_similarity = [(u, v) for u, v, d in integrated_bigscape_network_graph.edges(data=True) if d['relation_type'] == 'bigscape_similarity']
# Filter edges with relation_type = 'knownclusterblast' and similarity above the threshold
edges_with_knownclusterblast = [(u, v) for u, v, d in integrated_bigscape_network_graph.edges(data=True) if d['relation_type'] == 'knownclusterblast' and d['similarity'] > knownclusterblast_similarity_cutoff]
# Create subgraph
bigscape_network_graph_knownclusterblast = integrated_bigscape_network_graph.edge_subgraph(edges_with_knownclusterblast)
bigscape_network_graph_bigscape_similarity = integrated_bigscape_network_graph.edge_subgraph(edges_with_bigscape_similarity)
# Calculate stats for bigscape_network_graph
display(Markdown(get_graph_stats(integrated_bigscape_network_graph, "integrated_bigscape_network_graph")))
# Calculate stats for bigscape_network_graph_knownclusterblast
display(Markdown(get_graph_stats(bigscape_network_graph_knownclusterblast, f"bigscape_network_graph_knownclusterblast with similarity > {knownclusterblast_similarity_cutoff}")))
# Calculate stats for bigscape_network_graph_bigscape_similarity
display(Markdown(get_graph_stats(bigscape_network_graph_bigscape_similarity, "bigscape_network_graph_bigscape_similarity")))
```
-
integrated_bigscape_network_graph : 898 total nodes (739 BGC nodes), 1809 edges, 4.03 average degree
-
bigscape_network_graph_knownclusterblast with similarity > 0.8 : 156 total nodes (135 BGC nodes), 135 edges, 1.73 average degree
-
bigscape_network_graph_bigscape_similarity : 740 total nodes (737 BGC nodes), 1201 edges, 3.25 average degree
Next, we will filter the BiG-FAM hits based on the top taxa distribution in the BiG-FAM database:
??? code "Code" ```python bigfam_models_stats = pd.DataFrame.from_dict({n:data for n, data in bigfam_network_graph.nodes(data=True) if data["node_trace"] != "BGC"}).T
deleted_bigfam = []
bigfam_taxa_cutoff = 0.3
bigfam_filter = bigfam_models_stats[bigfam_models_stats.top_taxa_proportion <= bigfam_taxa_cutoff]
for n in bigfam_filter.index:
try:
bigfam_network_graph.remove_node(n)
deleted_bigfam.append(n)
except nx.NetworkXError as e:
print(e)
deleted_bigfam = ', '.join([str(i) for i in deleted_bigfam])
display(Markdown(f"Deleted BigFam models: {deleted_bigfam}"))
```
Deleted BigFam models: 202087, 203037, 210179, 213140, 208355, 202433, 225711, 201637, 201682, 201608, 202162, 205957, 215277, 202082, 200980, 201592, 218961, 227201, 201830, 201902
??? code "Code" ```python # Filter edges with rank = 0 edges_with_rank_filtered = [(u, v) for u, v, d in bigfam_network_graph.edges(data=True) if d['rank'] < bigfam_rank_filter]
# Create subgraph
bigfam_network_graph_filtered = bigfam_network_graph.edge_subgraph(edges_with_rank_filtered)
# Iterate over the edges of bigfam_network_graph_filtered
for u, v, data in bigfam_network_graph_filtered.edges(data=True):
# Get the relation_type attribute
if "relation_type" not in data.keys():
data["relation_type"] = "bigfam_similarity"
# Calculate stats for bigfam_network_graph
display(Markdown(get_graph_stats(bigfam_network_graph, "bigfam_network_graph")))
# Calculate stats for bigfam_network_graph_filtered
display(Markdown(get_graph_stats(bigfam_network_graph_filtered, "bigfam_network_graph_filtered")))
```
-
bigfam_network_graph : 470 total nodes (367 BGC nodes), 362 edges, 1.54 average degree
-
bigfam_network_graph_filtered : 424 total nodes (331 BGC nodes), 331 edges, 1.56 average degree
Once we cleaned up the subgraphs, we then recombine them all again:
??? code "Code"
python
# Combine Graphs
G = nx.compose(bigscape_network_graph_bigscape_similarity, bigscape_network_graph_knownclusterblast)
G = nx.compose(G, bigfam_network_graph_filtered)
display(Markdown(get_graph_stats(G, "integrated_graph")))
- integrated_graph : 855 total nodes (739 BGC nodes), 1667 edges, 3.90 average degree
Finally, we will add the metadata of the ARTS2 hits for prioritization:
??? code "Code" ```python def custom_agg(x): if x.dtypes == object: return ', '.join(x) else: return x.sum()
df_arts_hits_filtered = df_arts_hits[(~df_arts_hits.bgc_id.isna()) & (df_arts_hits.hits > 1)].loc[:, ["bgc_id", "profile", "hits", "Dup", "HGT", "Res", "BGC"]]
df_arts_hits_filtered = df_arts_hits_filtered.apply(lambda x: x.map(lambda y: y.replace('✔', '1').replace('✖', '0') if isinstance(y, str) else y))
df_arts_hits_filtered = df_arts_hits_filtered.apply(lambda x: x.map(lambda y: int(y) if isinstance(y, str) and y.isdigit() else y))
df_arts_hits_filtered = df_arts_hits_filtered.groupby("bgc_id").agg(custom_agg)
df_arts_hits_filtered = df_arts_hits_filtered.rename(columns={c:f"ARTS_{c}" for c in df_arts_hits_filtered.columns})
display(HTML(DT(df_arts_hits_filtered.reset_index(), scrollX=True)))
```
bgc_id | ARTS_profile | ARTS_hits | ARTS_Dup | ARTS_HGT | ARTS_Res | ARTS_BGC |
---|---|---|---|---|---|---|
Loading... (need help?) |
This table are then loaded as the node information for the graph:
??? code "Code"
python
for n in df_arts_hits_filtered.index:
for c in df_arts_hits_filtered.columns:
G.nodes[n][c] = df_arts_hits_filtered.loc[n, c]
Followed by some sanitazion:
??? code "Code" ```python # Get self-loops self_loops = nx.selfloop_edges(G)
# Check if there are any self-loops
if self_loops is not None:
# Remove self-loops
G.remove_edges_from(self_loops)
# Iterate over the nodes of the graph, getting the node and its attributes
for n, data in G.nodes(data=True):
# Create a list of keys to remove after iterating over the dictionary
keys_to_remove = []
# Iterate over the items in the attributes dictionary
for k, v in data.items():
# Check if the value is not of a type compatible with GraphML
if isinstance(v, list):
data[k] = ", ".join([str(i) for i in v])
elif v is None:
# Add the key to the list of keys to remove
keys_to_remove.append(k)
elif not isinstance(v, (int, float, str, bool, np.int64)):
print(f"Node {n} has attribute {k} of incompatible type {type(v)}")
# Remove the keys with None values
for key in keys_to_remove:
del G.nodes[n][key]
```
And now, the graph is ready and can be downloaded for display in network visualization tools:
??? code "Code"
python
outfile = Path(f"assets/integrated_network__bigscape_{bigscape_cutoff}__bigfam_{bigfam_rank_filter}__knownclusterblast_{knownclusterblast_similarity_cutoff}.graphml")
outfile.parent.mkdir(parents=True, exist_ok=True)
nx.write_graphml(G, outfile)
display(Markdown(f"[Download Graph]({str(outfile)})"+'{:target="_blank" .md-button}'))
Network Visualization
While we recommend users to use Network Visualization tools such as Cytoscape or Gephi, we can also attempt to visualize the network in Python
??? code "Code" ```python # define layout options options = { 'prog': 'twopi', #'args': ' '.join(['-Gstart=10', '-Goverlap_scaling=-100']) }
# position nodes
pos = nx.nx_agraph.pygraphviz_layout(G, **options)#, args='-Goverlap=false -Elen=weight')
for n, p in pos.items():
G.nodes[n]['pos'] = p
# set up text display
node_trace = G.nodes[n]["node_trace"]
if "text" not in G.nodes[n].keys():
text = [n]
for k, v in G.nodes[n].items():
if node_trace == "BGC":
if k in ["genome_id", "product", "bigscape_class", "Organism"]:
text.append(f"{k} : {v}")
text = "<br>".join(text)
G.nodes[n]['text'] = text
bigscape_class_labels = set([data['bigscape_class'] for n, data in G.nodes(data=True) if data["node_trace"] == "BGC"])
bigscape_class_colors = sns.color_palette("colorblind", len(bigscape_class_labels)).as_hex()
# define visualization
edge_annotation_map = {'bigscape_similarity' : {'color':'black',
'width':10
},
'knownclusterblast' : {'color':'grey',
'width':0.1
},
'bigfam_similarity' : {'color':'grey',
'width':0.1
},
}
node_annotation_map = {'MIBIG_BiG-SCAPE' : {'color':'green',
'node_symbol' : 'star'},
'MIBIG_knownclusterblast': {'color':'blue',
'node_symbol' : 'star'},
"BGC" : {'color':'grey',
'node_symbol' : 'circle'},
"BiG-FAM GCFs" : {'color':'yellow',
'node_symbol' : 'triangle-up'},
}
```
??? code "Code" ```python traces = [] node_trace = [] edge_trace = []
for e in edge_annotation_map.keys():
dash = "solid"
if 'knownclusterblast' in e:
dash = "dot"
edge_trace = create_edge_trace(G, e, color=edge_annotation_map[e]['color'], dash=dash, showlegend=True)
traces.append(edge_trace)
for trace in node_annotation_map.keys():
nodeopacity = 0.5
showtextlabel = False
linecolor = "black"
linewidth = 0.5
textposition="top left"
node_size = 8
if trace in bigscape_class_labels:
nodeopacity = 0.8
node_trace = create_node_trace(G, trace, node_annotation_map[trace]['color'], showtextlabel=showtextlabel,
nodesymbol=node_annotation_map[trace]['node_symbol'], nodeopacity=nodeopacity,
showlegend=True, linecolor=linecolor, linewidth=linewidth, nodesize=node_size,
textposition=textposition)
traces.append(node_trace)
G_arts = G.copy()
for n in G_arts.nodes:
if n in df_arts_hits_filtered.index:
G_arts.nodes[n]["node_trace"] = "ARTS_hits"
nodeopacity = 0.5
showtextlabel = False
node_symbol = "circle"
node_color = "grey"
linecolor = "red"
linewidth = 1
textposition="top left"
node_size = 8
node_trace = create_node_trace(G_arts, "ARTS_hits", node_color, showtextlabel=showtextlabel,
nodesymbol=node_symbol, nodeopacity=nodeopacity,
showlegend=True, linecolor=linecolor, linewidth=linewidth, nodesize=node_size,
textposition=textposition)
traces.append(node_trace)
fig = go.Figure(data=traces,
layout=go.Layout(
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='white',
showlegend=True,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False, linecolor='black', mirror=True, linewidth=1),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, linecolor='black', mirror=True, linewidth=1),
width=800, height=600)
)
outfile = Path("assets/figures/integrated_network.html")
outfile.parent.mkdir(parents=True, exist_ok=True)
fig.write_html(outfile)
display(HTML(filename=str(outfile)))
```
Adding adjacency edge
If you have a high quality (complete) genomes, it might be interesting to see the position of each BGC regions relative to each other in a genome. Unfortunately, there are limited physics-based layout algorithm in python, so it is better to use Cytoscape or Gephi to visualize the network
??? code "Code" ```python antismash_table = report_dir / f"tables/df_regions_antismash_{antismash_version}.csv" df_antismash = pd.read_csv(antismash_table)
# Create a new graph
filtered_graph = nx.Graph()
# Iterate over the nodes in the original graph
for node, data in G.nodes(data=True):
# If the node meets the condition, add it to the new graph
if data.get('node_trace') == 'BGC':
filtered_graph.add_node(node, **data)
# Now add only the edges that connect the nodes in the new graph
for u, v, data in G.edges(data=True):
if filtered_graph.has_node(u) and filtered_graph.has_node(v):
filtered_graph.add_edge(u, v, **data)
for i in df_antismash.index:
current_bgc = df_antismash.loc[i].to_dict()
next_index = i + 1
if next_index < len(df_antismash):
neighbor_bgc = df_antismash.loc[next_index].to_dict()
if current_bgc["accession"] == neighbor_bgc["accession"]:
distance = neighbor_bgc["start_pos"] - current_bgc["end_pos"]
assert distance > 0
if not filtered_graph.has_edge(current_bgc["bgc_id"], neighbor_bgc["bgc_id"]):
filtered_graph.add_edge(current_bgc["bgc_id"], neighbor_bgc["bgc_id"], distance_bp=distance, relation_type="genomic_adjacency")
display(Markdown(get_graph_stats(filtered_graph, "integrated_graph_with_adjacency")))
```
- integrated_graph_with_adjacency : 739 total nodes (739 BGC nodes), 1598 edges, 4.32 average degree
??? code "Code" ```python # Get self-loops self_loops = nx.selfloop_edges(filtered_graph)
# Check if there are any self-loops
if self_loops is not None:
# Remove self-loops
filtered_graph.remove_edges_from(self_loops)
# Iterate over the nodes of the graph, getting the node and its attributes
for n, data in filtered_graph.nodes(data=True):
# Create a list of keys to remove after iterating over the dictionary
keys_to_remove = []
# Iterate over the items in the attributes dictionary
for k, v in data.items():
# Check if the value is not of a type compatible with GraphML
if isinstance(v, (list, tuple)):
data[k] = ", ".join([str(i) for i in v])
elif v is None:
# Add the key to the list of keys to remove
keys_to_remove.append(k)
elif not isinstance(v, (int, float, str, bool, np.int64)):
print(f"Node {n} has attribute {k} of incompatible type {type(v)}")
# Remove the keys with None values
for key in keys_to_remove:
del filtered_graph.nodes[n][key]
outfile = Path(f"assets/data/bigscape_{bigscape_cutoff}_as{antismash_version}_with_genomic_position.graphml")
outfile.parent.mkdir(parents=True, exist_ok=True)
nx.write_graphml(filtered_graph, outfile)
display(Markdown(f"[Download Graph]({str(outfile)})"+'{:target="_blank" .md-button}'))
```
Displaying the notebook in the report
To display this notebook in the HTML report, we need to convert this notebook into a Markdown file.
First, we need to serve BGCFlow report this command:
bgcflow serve --project <project name>
The report will be served locally in http://localhost:8001/
Then, to generate the Markdown use NBConvert in the terminal:
# cd to the docs folder containing this notebook
jupyter nbconvert --to markdown \
--execute "custom_tutorial.ipynb" \
--output "custom_tutorial.md" \
--template "admonition" \
--TemplateExporter.extra_template_basedirs="../../../../workflow/notebook/nb_convert"
You can also use the --no-input
flag if you don't want to show the code cells.
Then, add the Markdown file to the mkdocs.yaml
navigation:
extra:
social:
- icon: fontawesome/brands/twitter
link: https://twitter.com/NPGMgroup
- icon: fontawesome/brands/github
link: https://github.com/NBChub/bgcflow
markdown_extensions:
- attr_list
- admonition
- pymdownx.details
- pymdownx.superfences
nav:
- Home: index.md
- QC and Data Selection:
- seqfu: seqfu.md
- mash: mash.md
- fastani: fastani.md
- checkm: checkm.md
- Functional Annotation:
- prokka-gbk: prokka-gbk.md
- deeptfactor: deeptfactor.md
- Genome Mining:
- antismash: antismash.md
- query-bigslice: query-bigslice.md
- bigscape: bigscape.md
- bigslice: bigslice.md
- arts: arts.md
- cblaster-genome: cblaster-genome.md
- cblaster-bgc: cblaster-bgc.md
- Phylogenomic Placement:
- automlst-wrapper: automlst-wrapper.md
- Comparative Genomics:
- roary: roary.md
- eggnog-roary: eggnog-roary.md
- Custom Reports:
- How to Add Custom Analysis to the BGCFlow Report: custom_tutorial.md
theme:
icon:
admonition:
note: octicons/tag-16
code: material/code-tags
...
Once you updated the mkdocs.yaml
file, the site will be regenerated to show your newly added report.
The report can be opened here http://localhost:8001/custom_tutorial/