Custom tutorial

This tutorial aims to show the flexibility of modifying, or generating markdown reports from BGCFlow template notebooks.

We will go through a simple analysis of filtering results from different genome mining tools and finding our BGC of interest.

Prerequisites

This tutorial utilizes the example dataset used in the BGCFlow manuscript

To run this tutorial, you will need to finish the mq_saccharopolyspora run and build the report.

MQ_Saccharopolyspora Project Configuration

Create a new project configuration folder in the config folder and copy all the necessary project files there. Then run and build the report as follows:

# Finish running BGCFlow
bgcflow run

# Build the report
bgcflow build report

Once the run finishes, check the result in the processed directory (data/processed/mq_saccharopolyspora), it should generate a folder structure similar to this:

.
├── antismash
├── bigscape
├── bigslice
├── data_warehouse
├── docs
│   ├── antismash.ipynb
│   ├── antismash.md
│   ├── arts.ipynb
│   ├── arts.md
│   ├── assets
│   ├── bigscape.ipynb
│   ├── bigscape.md
│   ├── index.md
│   ├── query-bigslice.ipynb
│   └── query-bigslice.md
├── log_changes
├── main.py
├── metadata
├── mkdocs.yml
├── overrides
├── README.md
└── tables

BGCFlow contains a starter Jupyter notebook template, generated in the docs folder, which are then converted to a markdown file, which can be used by mkdocs to generate a static HTML site.

For more details about customizing the HTML report, see https://squidfunk.github.io/mkdocs-material/

Default conda environments for the notebooks

When you run bgcflow build report, you can actually see which conda environments are being used Snakemake to generate the report. You can either use this existing environments, or set it up on your own from the recipe.

There are two environments that BGCFlow used to make reports, a Python and a mix of R and Python environment:

Python - bgcflow_notes.yaml
R - r_notebook.yaml

You can install the conda environments using those yaml file by:

# use conda or mamba
mamba env create -f bgcflow_notes.yaml # or r_notebook.yaml

Adding a new notebook to the analysis

To add your own analysis, you can start up Jupyter session using the environments above, and create a new notebook inside the docs folder. You can also download this notebook tutorial here:

Download Notebook

Let's give our notebook the name custom_tutorial.ipynb

Tutorial 1: Using Tables to Filter and Identify Unique Polyketide BGCs

In this notebook, we aim to identify unique polyketide biosynthetic gene clusters (BGCs) that do not have hits in BigFam or MIBiG databases but have been identified in ARTS. This analysis can be used as the first step in the exploratory analysis of discovering potentially novel polyketide BGCs with unique functions.

Setting up libraries and environment variables

We will start by use Python and pandas for data manipulation and filtering.

First, let's import the necessary Python libraries and define the directory containing our report data.

??? code "Code" ```python from pathlib import Path import json import pandas as pd

from IPython.display import display, Markdown, HTML

from itables import to_html_datatable as DT
import itables.options as opt

opt.css = """
.itables table td { font-style: italic; font-size: .8em;}
.itables table th { font-style: oblique; font-size: .8em; }
"""

opt.classes = ["display", "compact"]
opt.lengthMenu = [5, 10, 20, 50, 100, 200, 500]

# Define the directory containing the report
report_directory = Path("../")
```

Note from the code that the report directory is located one directory above of this notebook (which is located in the docs folder)

The metadata folder records some of the important software versions and also other information of the BGCFlow runs. First, we will fetch the antiSMASH version used in the run from the metadata.

??? code "Code" ```python # Load the dependency versions dependency_versions_file = report_directory / "metadata/dependency_versions.json" with open(dependency_versions_file, "r") as file: dependency_versions = json.load(file)

# Extract the version of antiSMASH used
antismash_version = dependency_versions["antismash"]

display(Markdown(f"> antiSMASH version is: `{antismash_version}`"))
```

antiSMASH version is: 7.1.0

Setting up input files

First, we need to find all the necessary tables or files required for our analysis. Here, we will need the results from antiSMASH, BiG-SCAPE, BiG-FAM query, and also ARTS2.

Most of the tables can be found in the tables folder, such as the antiSMASH summary region tables:

??? code "Code" python # Define the paths to the input files antismash_regions_file = report_directory / f"tables/df_regions_antismash_{antismash_version}.csv" display(Markdown(f">`{antismash_regions_file}`"))

../tables/df_regions_antismash_7.1.0.csv

Some other tables are located in their specific directories, such BiG-SCAPE

??? code "Code" ```python # Define the directory containing the BIG-SCAPE data bigscape_directory = report_directory / f"bigscape/for_cytoscape_antismash_{antismash_version}/"

# Find the cluster table file
cluster_table_file = [i for i in bigscape_directory.glob("*_df_clusters_0.30.csv")][0]
display(Markdown(f">`{cluster_table_file}`"))
```

../bigscape/for_cytoscape_antismash_7.1.0/2024-04-22 20_44_19_df_clusters_0.30.csv

??? code "Code" python # Define the directory containing the query data bigfam_query = report_directory / f"bigslice/query_as_{antismash_version}/query_network.csv" display(Markdown(f">`{bigfam_query}`"))

../bigslice/query_as_7.1.0/query_network.csv

Other tables are created by the Jupyter notebook templates, and usually can be interactively downloaded from the HTML report. By convention, this are stored in the assets directory within the docs folder.

??? code "Code" python arts_table_file = Path(f"assets/tables/arts_hits_as{antismash_version}.csv") display(Markdown(f">`{arts_table_file}`"))

assets/tables/arts_hits_as7.1.0.csv

Using pandas and itables to show tables interactively

While pandas can show nice summary of a table, it does not do so interactively, and does not render well in the HTML report. We can use itables to display our tables as interactive datatables that we can sort, paginate, scroll or filter. To enable this feature in the final report, we are converting the tables to HTML datatables and displaying it with iPython.

??? code "Code" ```python # Load the data from the input files df_antismash_regions = pd.read_csv(antismash_regions_file) # Correct similarity values and fill null values with 0 df_antismash_regions["similarity"] = df_antismash_regions["similarity"].apply(lambda x: 1 if x > 1 else x).fillna(0)

df_bigscape = pd.read_csv(cluster_table_file)
df_arts_hits = pd.read_csv(arts_table_file)
df_bigfam_hits = pd.read_csv(bigfam_query )
```

??? code "Code" python display(Markdown(f">`{antismash_regions_file}`")) display(HTML(DT(df_antismash_regions, scrollX=True)))

../tables/df_regions_antismash_7.1.0.csv

bgc_id	genome_id	region	accession	start_pos	end_pos	contig_edge	product	region_length	most_similar_known_cluster_id	most_similar_known_cluster_description	most_similar_known_cluster_type	similarity	source	gbk_path
Loading... (need help?)

??? code "Code" python display(Markdown(f">`{cluster_table_file}`")) display(HTML(DT(df_bigscape, scrollX=True)))

../bigscape/for_cytoscape_antismash_7.1.0/2024-04-22 20_44_19_df_clusters_0.30.csv

bgc_id	product	bigscape_class	genome_id	accn_id	gcf_0.30	gcf_0.40	gcf_0.50	Clan Number	fam_id_0.30	fam_type_0.30	fam_known_compounds_0.30
Loading... (need help?)

??? code "Code" python display(Markdown(f">`{arts_table_file}`")) display(HTML(DT(df_arts_hits, scrollX=True)))

assets/tables/arts_hits_as7.1.0.csv

pkey	profile	genome_id	name	product	start	stop	strand	scaffold	sequence_id	type	transl_table	core_table_fkey	Dup	HGT	Res	bgc_id	gene	description	function	bgctable_fkey	BGC	duptable_fkey	evalue	bitscore	gid	seqtitle	hits
Loading... (need help?)

??? code "Code" python display(Markdown(f">`{bigfam_query}`")) display(HTML(DT(df_bigfam_hits, scrollX=True)))

../bigslice/query_as_7.1.0/query_network.csv

gcf_id	bgc_id	membership_value	rank
Loading... (need help?)

Note that these tables are downsampled by default to prevent the Markdown report become heavy. See https://mwouts.github.io/itables/downsampling.html for more details.

We do not recommend to use the HTML reports to show big tables as it's purpose is to give a quick summary of the analysis.

Exploratory Data Analysis

BiG-SCAPE filtering

Let's start by looking at the BiG-SCAPE category of unknown PKS. We will create a filtering following this logic:

Define Classes to Filter By: We will select all possible PKS categories in BiG-SCAPE and put it in a list named bigscape_class, containing PKSI, PKSother, and PKS-NRP_Hybrids. To see all available bigscape_class, do: df_bigscape["bigscape_class"].unique()
Filter for unknown families: The column fam_type_0.30 defines whether the BGCs belongs to a known or unknown GCFs using the cutoff value of 0.3. We will set the string variable family_type to "unknown_family", which then used to filter rows based on a column in the DataFrame related to family types.
Create Masks for Filtering:
- mask1: This is a boolean mask created by checking if the values in the column fam_type_0.30 of df_bigscape are equal to family_type ("unknown_family"). This mask is true for rows where the family type is unknown.
- mask2: This is another boolean mask created by checking if the values in the column bigscape_class are within the list bigscape_class defined at the start. This mask is true for rows that match any of the specified classes (PKSI, PKSother, PKS-NRP_Hybrids).
Filter DataFrame: The DataFrame df_bigscape is filtered using the logical AND (&) of mask1 and mask2. This means only rows where both conditions are true (i.e., the family type is "unknown_family" and the bigscape class is one of the specified classes) are selected.

??? code "Code" python bigscape_class = ['PKSI', 'PKSother', 'PKS-NRP_Hybrids'] family_type = "unknown_family" mask1 = df_bigscape["fam_type_0.30"] == family_type mask2 = df_bigscape["bigscape_class"].isin(bigscape_class) df_bigscape_PKS_unknown = df_bigscape[mask1 & mask2] display(HTML(DT(df_bigscape_PKS_unknown, scrollX=True)))

	bgc_id	product	bigscape_class	genome_id	accn_id	gcf_0.30	gcf_0.40	gcf_0.50	Clan Number	fam_id_0.30	fam_type_0.30	fam_known_compounds_0.30
Loading... (need help?)

??? code "Code" ```python number_of_gcf = len(df_bigscape_PKS_unknown["fam_id_0.30"].unique())

text = f"""We found {df_bigscape.shape[0]} BGC regions in the category of {', '.join(bigscape_class)} which belongs to {family_type.replace('_', ' ')}.\
 These BGCs can be grouped into {len(df_bigscape_PKS_unknown["fam_id_0.30"].unique())} GCFs."""
Markdown(text)
```

We found 739 BGC regions in the category of PKSI, PKSother, PKS-NRP_Hybrids which belongs to unknown family. These BGCs can be grouped into 70 GCFs.

AntiSMASH KnownClusterBlast filtering

We will now look at the KnownClusterBlast similarity score and decide if we need to further filter our search. We will subset the antiSMASH BGC regions table to only contains BGCs identified in the previous step and observe the KnownClusterBlast similarity values. We will narrow down our search by setting up a threshold for the similarity score.

??? code "Code" python # Extract the BGC IDs of the unknown families unknown_family_bgcs = df_bigscape_PKS_unknown.bgc_id.to_list()

??? code "Code" python mask_unknown_family = df_antismash_regions.bgc_id.isin(unknown_family_bgcs) columns_to_show = ["bgc_id", "genome_id", "product", "similarity", "most_similar_known_cluster_id", "most_similar_known_cluster_description", "region", "contig_edge", "region_length", ] display(HTML(DT(df_antismash_regions[mask_unknown_family].loc[:, columns_to_show], scrollX=True)))

	bgc_id	genome_id	product	similarity	most_similar_known_cluster_id	most_similar_known_cluster_description	region	contig_edge	region_length
Loading... (need help?)

??? code "Code" python similarity_cutoff = 0.30 mask_similarity = df_antismash_regions.similarity < similarity_cutoff text2 = f"From those **{df_antismash_regions[mask_unknown_family].shape[0]} BGCs**, we actually still have a few BGCs with high similarity based on KnownClusterBlast.\ Let us focus on BGCs with low similarity **(<{similarity_cutoff})**, and now we are left with **{df_antismash_regions[mask_unknown_family & mask_similarity].shape[0]} BGCs**." Markdown(text2)

From those 123 BGCs, we actually still have a few BGCs with high similarity based on KnownClusterBlast. Let us focus on BGCs with low similarity (<0.3), and now we are left with 67 BGCs.

??? code "Code" python display(HTML(DT(df_antismash_regions[mask_unknown_family & mask_similarity].loc[:, columns_to_show], scrollX=True)))

	bgc_id	genome_id	product	similarity	most_similar_known_cluster_id	most_similar_known_cluster_description	region	contig_edge	region_length
Loading... (need help?)

BiG-FAM query filtering

We will also remove all BGCs that has a match to other BGCs in the BiG-FAM database.

??? code "Code" python mask_no_bigfam_hit = ~df_antismash_regions.bgc_id.isin(df_bigfam_hits.bgc_id.unique()) display(HTML(DT(df_antismash_regions[mask_unknown_family & mask_similarity & mask_no_bigfam_hit].loc[:, columns_to_show], scrollX=True)))

	bgc_id	genome_id	product	similarity	most_similar_known_cluster_id	most_similar_known_cluster_description	region	contig_edge	region_length
Loading... (need help?)

ARTS hits filtering

We will now select only BGCs with close proximity to one of the ARTS2 model. Note that ARTS2 actually have several criteria to prioritise BGCs with possible antibiotic activity, here we just search for all BGCs that has hits to ARTS2 profile.

??? code "Code" python mask_arts_hit = df_antismash_regions.bgc_id.isin(df_arts_hits.bgc_id.dropna().unique()) df_pks_filtered = df_antismash_regions[mask_unknown_family & mask_similarity & mask_no_bigfam_hit & mask_arts_hit] display(HTML(DT(df_pks_filtered.loc[:, columns_to_show], scrollX=True)))

	bgc_id	genome_id	product	similarity	most_similar_known_cluster_id	most_similar_known_cluster_description	region	contig_edge	region_length
Loading... (need help?)

Conclusion for Tutorial 1

??? code "Code" python text3 = f"We are now left with **{df_pks_filtered.shape[0]} PKS BGCs** of interest which can be further invesigated in-depth." Markdown(text3)

We are now left with 15 PKS BGCs of interest which can be further invesigated in-depth.

This tutorial help us narrow down our search to potentially novel PKS. Nevertheless, an in-depth investigation to the BGC structure and see if it has all the required genes to perform natural product biosynthesis.

We also recommend users to try out Metabase for a more interactive experience of exploring the datasets. See the how to guide in our WiKi

Tutorial 2: Exploring Integrated Network

In the previous tutorial, we have shown how to create a custom notebook, explore, and filter the data using tables generated by BGCFlow.

In this notebook, we will do an integrated analysis utilizing networks of knowledge bases from different tools: BiG-SCAPE, antiSMASH KnownClusterBlast, BiG-FAM database query, and ARTS2. We will generate a graphml file with annotation that can be loaded to Cytoscape and also also attempt to visualize the network in this notebook.

??? code "Code" ```python import pandas as pd import networkx as nx from pathlib import Path import json import numpy as np import plotly.graph_objects as go import seaborn as sns from IPython.display import display, Markdown, HTML

from itables import to_html_datatable as DT
import itables.options as opt

opt.css = """
.itables table td { font-style: italic; font-size: .8em;}
.itables table th { font-style: oblique; font-size: .8em; }
"""

opt.classes = ["display", "compact"]
opt.lengthMenu = [5, 10, 20, 50, 100, 200, 500]

def create_node_trace(G, node_trace_category, color, showtextlabel=False, nodesize=10, nodeopacity=0.8, 
                      nodesymbol="circle", linewidth=1, linecolor="black", textposition="top center", showlegend=False):
    """
    Create a node trace for a given graph.

    Parameters:
    G (networkx.Graph): The graph to create the node trace for.
    node_trace_category (str): The category of the node trace.
    color (str): The color of the nodes.
    showtextlabel (bool): Whether to show text labels for the nodes.
    nodesize (int): The size of the nodes.
    nodeopacity (float): The opacity of the nodes.
    nodesymbol (str): The symbol used for the nodes.
    linewidth (int): The width of the lines.
    linecolor (str): The color of the lines.
    textposition (str): The position of the text labels.
    showlegend (bool): Whether to show the legend.

    Returns:
    go.Scatter: The node trace.
    """
    if showtextlabel:
        markermode = "markers+text"
    else:
        markermode = "markers"
    nodes = np.array([node for node in G.nodes() if G.nodes[node]["node_trace"] == node_trace_category])
    pos = np.array([G.nodes[node]['pos'] for node in nodes.flatten()]).reshape(-1, 2)
    xs, ys = pos[:, 0], pos[:, 1]
    texts = np.array([G.nodes[node]['text'] for node in nodes if "text" in G.nodes[node].keys()])
    node_trace = go.Scatter(
        x=xs.tolist(),
        y=ys.tolist(),
        text=texts.tolist(),
        textposition=textposition,
        mode=markermode,
        hoverinfo='text',
        name=node_trace_category,
        showlegend=showlegend,
        marker=dict(
            symbol=nodesymbol,
            opacity=nodeopacity,
            showscale=False,
            color=color,
            size=nodesize,
            line=dict(width=linewidth, color=linecolor)))
    return node_trace

def create_edge_trace(Graph, name, showlegend=False, color='#888', width=0.5, opacity=0.8, dash="solid"):
    """
    Create an edge trace for a given graph.

    Parameters:
    Graph (networkx.Graph): The graph to create the edge trace for.
    name (str): The name of the edge trace.
    showlegend (bool): Whether to show the legend.
    color (str): The color of the edges.
    width (float): The width of the edges.
    opacity (float): The opacity of the edges.
    dash (str): The style of the edges.

    Returns:
    go.Scatter: The edge trace.
    """
    edge_trace = go.Scatter(
        x=[],
        y=[],
        name=name,
        opacity=opacity,
        line=dict(width=width,color=color, dash=dash),
        hoverinfo='none',
        mode='lines',
        showlegend=showlegend)

    edges = np.array([edge for edge in Graph.edges() if G.edges[edge]["relation_type"] == name])
    pos = np.array([Graph.nodes[e]['pos'] for e in edges.flatten()]).reshape(-1, 2)
    xs = np.insert(pos[:, 0], np.arange(2, len(pos[:, 0]), 2), None)
    ys = np.insert(pos[:, 1], np.arange(2, len(pos[:, 1]), 2), None)
    edge_trace['x'] = xs
    edge_trace['y'] = ys

    return edge_trace

def get_graph_stats(g, graph_name):
    """
    Get statistics for a given graph.

    Parameters:
    g (networkx.Graph): The graph to get statistics for.
    graph_name (str): The name of the graph.

    Returns:
    str: A string containing the statistics for the graph.
    """
    num_nodes_g = g.number_of_nodes()
    num_edges_g = g.number_of_edges()
    avg_degree_g = sum(dict(g.degree()).values()) / num_nodes_g
    num_bgc_g = sum(1 for _, data in g.nodes(data=True) if data.get('node_trace') == 'BGC')
    return f" - {graph_name} : **{num_nodes_g}** total nodes (**{num_bgc_g}** BGC nodes), **{num_edges_g}** edges, {avg_degree_g:.2f} average degree"
```

Handling Graphs: Reading, Filtering, and Merging Networks

We will start by defining some parameters for filtering:

??? code "Code" python bigscape_cutoff = "0.30" bigfam_rank_filter = 1 # only select first hits of bigfam models knownclusterblast_similarity_cutoff = 0.8 # select antismash knownclusterblast hits above 0.7

And then we will first list all the files that we need:

??? code "Code" ```python report_dir = Path("../") dependency_version = report_dir / "metadata/dependency_versions.json" with open(dependency_version, "r") as file: dependency_version = json.load(file)

antismash_version = dependency_version["antismash"]

assets_dir = report_dir / "docs/assets"
integrated_bigscape_network = assets_dir / f"data/bigscape_{bigscape_cutoff}_as{antismash_version}.graphml"
bigfam_network = assets_dir / f"data/query_bigfam_as{antismash_version}_network.json"
arts_hits = assets_dir / f"tables/arts_hits_as{antismash_version}.csv"
```

We will start by reading the input files into memory:

??? code "Code" ```python # read ARTS table df_arts_hits = pd.read_csv(arts_hits)

integrated_bigscape_network_graph = nx.read_graphml(integrated_bigscape_network)

# Iterate over the nodes of bigscape_network_graph_knownclusterblast
for n, data in integrated_bigscape_network_graph.nodes(data=True):
    # If node_trace is not in the node's attributes, add it with a default value
    if 'node_trace' not in data:
        if n.startswith('BGC'):
            integrated_bigscape_network_graph.nodes[n]['node_trace'] = 'MIBIG_BiG-SCAPE'
        else:
            integrated_bigscape_network_graph.nodes[n]['node_trace'] = 'BGC'

with open(bigfam_network, "r") as f:
    graph_data = json.load(f)
    bigfam_network_graph = nx.readwrite.json_graph.node_link_graph(graph_data)

for n, data in bigfam_network_graph.nodes(data=True):
    bigfam_network_graph.nodes[n]['node_trace'] = bigfam_network_graph.nodes[n]['node_type']

bigfam_network_graph = nx.relabel_nodes(bigfam_network_graph, lambda x: str(x))
```

And then continue by filtering the networks based on our cutoff definition.

In this first part, we will split again the BiG-SCAPE - antiSMASH KnownClusterBlast network into its parts:

??? code "Code" ```python # Filter edges with relation_type = 'bigscape_similarity' edges_with_bigscape_similarity = [(u, v) for u, v, d in integrated_bigscape_network_graph.edges(data=True) if d['relation_type'] == 'bigscape_similarity']

# Filter edges with relation_type = 'knownclusterblast' and similarity above the threshold
edges_with_knownclusterblast = [(u, v) for u, v, d in integrated_bigscape_network_graph.edges(data=True) if d['relation_type'] == 'knownclusterblast' and d['similarity'] > knownclusterblast_similarity_cutoff]

# Create subgraph
bigscape_network_graph_knownclusterblast = integrated_bigscape_network_graph.edge_subgraph(edges_with_knownclusterblast)
bigscape_network_graph_bigscape_similarity = integrated_bigscape_network_graph.edge_subgraph(edges_with_bigscape_similarity)

# Calculate stats for bigscape_network_graph
display(Markdown(get_graph_stats(integrated_bigscape_network_graph, "integrated_bigscape_network_graph")))

# Calculate stats for bigscape_network_graph_knownclusterblast
display(Markdown(get_graph_stats(bigscape_network_graph_knownclusterblast, f"bigscape_network_graph_knownclusterblast with similarity > {knownclusterblast_similarity_cutoff}")))

# Calculate stats for bigscape_network_graph_bigscape_similarity
display(Markdown(get_graph_stats(bigscape_network_graph_bigscape_similarity, "bigscape_network_graph_bigscape_similarity")))
```

integrated_bigscape_network_graph : 898 total nodes (739 BGC nodes), 1809 edges, 4.03 average degree
bigscape_network_graph_knownclusterblast with similarity > 0.8 : 156 total nodes (135 BGC nodes), 135 edges, 1.73 average degree
bigscape_network_graph_bigscape_similarity : 740 total nodes (737 BGC nodes), 1201 edges, 3.25 average degree

Next, we will filter the BiG-FAM hits based on the top taxa distribution in the BiG-FAM database:

??? code "Code" ```python bigfam_models_stats = pd.DataFrame.from_dict({n:data for n, data in bigfam_network_graph.nodes(data=True) if data["node_trace"] != "BGC"}).T

deleted_bigfam = []
bigfam_taxa_cutoff = 0.3
bigfam_filter = bigfam_models_stats[bigfam_models_stats.top_taxa_proportion <= bigfam_taxa_cutoff]
for n in bigfam_filter.index:
    try:
        bigfam_network_graph.remove_node(n)
        deleted_bigfam.append(n)  
    except nx.NetworkXError as e:
        print(e)

deleted_bigfam = ', '.join([str(i) for i in deleted_bigfam])
display(Markdown(f"Deleted BigFam models: {deleted_bigfam}"))
```

Deleted BigFam models: 202087, 203037, 210179, 213140, 208355, 202433, 225711, 201637, 201682, 201608, 202162, 205957, 215277, 202082, 200980, 201592, 218961, 227201, 201830, 201902

??? code "Code" ```python # Filter edges with rank = 0 edges_with_rank_filtered = [(u, v) for u, v, d in bigfam_network_graph.edges(data=True) if d['rank'] < bigfam_rank_filter]

# Create subgraph
bigfam_network_graph_filtered = bigfam_network_graph.edge_subgraph(edges_with_rank_filtered)
# Iterate over the edges of bigfam_network_graph_filtered
for u, v, data in bigfam_network_graph_filtered.edges(data=True):
    # Get the relation_type attribute
    if "relation_type" not in data.keys():
        data["relation_type"] = "bigfam_similarity"

# Calculate stats for bigfam_network_graph
display(Markdown(get_graph_stats(bigfam_network_graph, "bigfam_network_graph")))

# Calculate stats for bigfam_network_graph_filtered
display(Markdown(get_graph_stats(bigfam_network_graph_filtered, "bigfam_network_graph_filtered")))
```

bigfam_network_graph : 470 total nodes (367 BGC nodes), 362 edges, 1.54 average degree
bigfam_network_graph_filtered : 424 total nodes (331 BGC nodes), 331 edges, 1.56 average degree

Once we cleaned up the subgraphs, we then recombine them all again:

??? code "Code" python # Combine Graphs G = nx.compose(bigscape_network_graph_bigscape_similarity, bigscape_network_graph_knownclusterblast) G = nx.compose(G, bigfam_network_graph_filtered) display(Markdown(get_graph_stats(G, "integrated_graph")))

integrated_graph : 855 total nodes (739 BGC nodes), 1667 edges, 3.90 average degree

Finally, we will add the metadata of the ARTS2 hits for prioritization:

??? code "Code" ```python def custom_agg(x): if x.dtypes == object: return ', '.join(x) else: return x.sum()

df_arts_hits_filtered = df_arts_hits[(~df_arts_hits.bgc_id.isna()) & (df_arts_hits.hits > 1)].loc[:, ["bgc_id", "profile", "hits", "Dup", "HGT", "Res", "BGC"]]
df_arts_hits_filtered = df_arts_hits_filtered.apply(lambda x: x.map(lambda y: y.replace('✔', '1').replace('✖', '0') if isinstance(y, str) else y))
df_arts_hits_filtered = df_arts_hits_filtered.apply(lambda x: x.map(lambda y: int(y) if isinstance(y, str) and y.isdigit() else y))
df_arts_hits_filtered = df_arts_hits_filtered.groupby("bgc_id").agg(custom_agg)
df_arts_hits_filtered = df_arts_hits_filtered.rename(columns={c:f"ARTS_{c}" for c in df_arts_hits_filtered.columns})
display(HTML(DT(df_arts_hits_filtered.reset_index(), scrollX=True)))
```

bgc_id	ARTS_profile	ARTS_hits	ARTS_Dup	ARTS_HGT	ARTS_Res	ARTS_BGC
Loading... (need help?)

This table are then loaded as the node information for the graph:

??? code "Code" python for n in df_arts_hits_filtered.index: for c in df_arts_hits_filtered.columns: G.nodes[n][c] = df_arts_hits_filtered.loc[n, c]

Followed by some sanitazion:

??? code "Code" ```python # Get self-loops self_loops = nx.selfloop_edges(G)

# Check if there are any self-loops
if self_loops is not None:
    # Remove self-loops
    G.remove_edges_from(self_loops)

# Iterate over the nodes of the graph, getting the node and its attributes
for n, data in G.nodes(data=True):
    # Create a list of keys to remove after iterating over the dictionary
    keys_to_remove = []
    # Iterate over the items in the attributes dictionary
    for k, v in data.items():
        # Check if the value is not of a type compatible with GraphML
        if isinstance(v, list):
            data[k] = ", ".join([str(i) for i in v])
        elif v is None:
            # Add the key to the list of keys to remove
            keys_to_remove.append(k)
        elif not isinstance(v, (int, float, str, bool, np.int64)):
            print(f"Node {n} has attribute {k} of incompatible type {type(v)}")
    # Remove the keys with None values
    for key in keys_to_remove:
        del G.nodes[n][key]
```

And now, the graph is ready and can be downloaded for display in network visualization tools:

??? code "Code" python outfile = Path(f"assets/integrated_network__bigscape_{bigscape_cutoff}__bigfam_{bigfam_rank_filter}__knownclusterblast_{knownclusterblast_similarity_cutoff}.graphml") outfile.parent.mkdir(parents=True, exist_ok=True) nx.write_graphml(G, outfile) display(Markdown(f"[Download Graph]({str(outfile)})"+'{:target="_blank" .md-button}'))

Download Graph

Network Visualization

While we recommend users to use Network Visualization tools such as Cytoscape or Gephi, we can also attempt to visualize the network in Python

??? code "Code" ```python # define layout options options = { 'prog': 'twopi', #'args': ' '.join(['-Gstart=10', '-Goverlap_scaling=-100']) }

# position nodes
pos = nx.nx_agraph.pygraphviz_layout(G, **options)#, args='-Goverlap=false -Elen=weight')
for n, p in pos.items():
    G.nodes[n]['pos'] = p
    # set up text display
    node_trace = G.nodes[n]["node_trace"]
    if "text" not in G.nodes[n].keys():
        text = [n]
        for k, v in G.nodes[n].items():
            if node_trace == "BGC":
                if k in ["genome_id", "product", "bigscape_class", "Organism"]:
                    text.append(f"{k} : {v}")
        text = "<br>".join(text)
        G.nodes[n]['text'] = text

bigscape_class_labels = set([data['bigscape_class'] for n, data in G.nodes(data=True) if data["node_trace"] == "BGC"])
bigscape_class_colors = sns.color_palette("colorblind", len(bigscape_class_labels)).as_hex()

# define visualization
edge_annotation_map = {'bigscape_similarity' : {'color':'black',
                                                'width':10
                                               },
                       'knownclusterblast' : {'color':'grey',
                                                'width':0.1
                                               },
                       'bigfam_similarity' : {'color':'grey',
                                                'width':0.1
                                               },
                      }

node_annotation_map = {'MIBIG_BiG-SCAPE' : {'color':'green',
                                  'node_symbol' : 'star'},
                       'MIBIG_knownclusterblast': {'color':'blue',
                                  'node_symbol' : 'star'}, 
                       "BGC" : {'color':'grey',
                                'node_symbol' : 'circle'},
                       "BiG-FAM GCFs" : {'color':'yellow',
                                'node_symbol' : 'triangle-up'},
                      }
```

??? code "Code" ```python traces = [] node_trace = [] edge_trace = []

for e in edge_annotation_map.keys():
    dash = "solid"
    if 'knownclusterblast' in e:
        dash = "dot"
    edge_trace = create_edge_trace(G, e, color=edge_annotation_map[e]['color'], dash=dash, showlegend=True)
    traces.append(edge_trace)

for trace in node_annotation_map.keys():
    nodeopacity = 0.5
    showtextlabel = False
    linecolor = "black"
    linewidth = 0.5
    textposition="top left"
    node_size = 8
    if trace in bigscape_class_labels:
        nodeopacity = 0.8
    node_trace = create_node_trace(G, trace, node_annotation_map[trace]['color'], showtextlabel=showtextlabel, 
                                   nodesymbol=node_annotation_map[trace]['node_symbol'], nodeopacity=nodeopacity, 
                                   showlegend=True, linecolor=linecolor, linewidth=linewidth, nodesize=node_size,
                                   textposition=textposition)
    traces.append(node_trace)

G_arts = G.copy()
for n in G_arts.nodes:
    if n in df_arts_hits_filtered.index:
        G_arts.nodes[n]["node_trace"] = "ARTS_hits"

nodeopacity = 0.5
showtextlabel = False
node_symbol = "circle"
node_color = "grey"
linecolor = "red"
linewidth = 1
textposition="top left"
node_size = 8
node_trace = create_node_trace(G_arts, "ARTS_hits", node_color, showtextlabel=showtextlabel, 
                                   nodesymbol=node_symbol, nodeopacity=nodeopacity, 
                                   showlegend=True, linecolor=linecolor, linewidth=linewidth, nodesize=node_size,
                                   textposition=textposition)

traces.append(node_trace)

fig = go.Figure(data=traces,
                layout=go.Layout(
                    paper_bgcolor='rgba(0,0,0,0)',
                    plot_bgcolor='white',
                    showlegend=True,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False, linecolor='black', mirror=True, linewidth=1),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, linecolor='black', mirror=True, linewidth=1),
                    width=800, height=600)
                )

outfile = Path("assets/figures/integrated_network.html")
outfile.parent.mkdir(parents=True, exist_ok=True)
fig.write_html(outfile)

display(HTML(filename=str(outfile)))
```

Adding adjacency edge

If you have a high quality (complete) genomes, it might be interesting to see the position of each BGC regions relative to each other in a genome. Unfortunately, there are limited physics-based layout algorithm in python, so it is better to use Cytoscape or Gephi to visualize the network

??? code "Code" ```python antismash_table = report_dir / f"tables/df_regions_antismash_{antismash_version}.csv" df_antismash = pd.read_csv(antismash_table)

# Create a new graph
filtered_graph = nx.Graph()

# Iterate over the nodes in the original graph
for node, data in G.nodes(data=True):
    # If the node meets the condition, add it to the new graph
    if data.get('node_trace') == 'BGC':
        filtered_graph.add_node(node, **data)

# Now add only the edges that connect the nodes in the new graph
for u, v, data in G.edges(data=True):
    if filtered_graph.has_node(u) and filtered_graph.has_node(v):
        filtered_graph.add_edge(u, v, **data)

for i in df_antismash.index:
    current_bgc = df_antismash.loc[i].to_dict()
    next_index = i + 1
    if next_index < len(df_antismash):
        neighbor_bgc = df_antismash.loc[next_index].to_dict()
        if current_bgc["accession"] == neighbor_bgc["accession"]:
            distance = neighbor_bgc["start_pos"] - current_bgc["end_pos"]
            assert distance > 0
            if not filtered_graph.has_edge(current_bgc["bgc_id"], neighbor_bgc["bgc_id"]):
                filtered_graph.add_edge(current_bgc["bgc_id"], neighbor_bgc["bgc_id"], distance_bp=distance, relation_type="genomic_adjacency")

display(Markdown(get_graph_stats(filtered_graph, "integrated_graph_with_adjacency")))
```

integrated_graph_with_adjacency : 739 total nodes (739 BGC nodes), 1598 edges, 4.32 average degree

??? code "Code" ```python # Get self-loops self_loops = nx.selfloop_edges(filtered_graph)

# Check if there are any self-loops
if self_loops is not None:
    # Remove self-loops
    filtered_graph.remove_edges_from(self_loops)

# Iterate over the nodes of the graph, getting the node and its attributes
for n, data in filtered_graph.nodes(data=True):
    # Create a list of keys to remove after iterating over the dictionary
    keys_to_remove = []
    # Iterate over the items in the attributes dictionary
    for k, v in data.items():
        # Check if the value is not of a type compatible with GraphML
        if isinstance(v, (list, tuple)):
            data[k] = ", ".join([str(i) for i in v])
        elif v is None:
            # Add the key to the list of keys to remove
            keys_to_remove.append(k)
        elif not isinstance(v, (int, float, str, bool, np.int64)):
            print(f"Node {n} has attribute {k} of incompatible type {type(v)}")
    # Remove the keys with None values
    for key in keys_to_remove:
        del filtered_graph.nodes[n][key]

outfile = Path(f"assets/data/bigscape_{bigscape_cutoff}_as{antismash_version}_with_genomic_position.graphml")
outfile.parent.mkdir(parents=True, exist_ok=True)
nx.write_graphml(filtered_graph, outfile)
display(Markdown(f"[Download Graph]({str(outfile)})"+'{:target="_blank" .md-button}'))
```

Download Graph

Displaying the notebook in the report

To display this notebook in the HTML report, we need to convert this notebook into a Markdown file.

First, we need to serve BGCFlow report this command:

bgcflow serve --project <project name>

The report will be served locally in http://localhost:8001/

Then, to generate the Markdown use NBConvert in the terminal:

# cd to the docs folder containing this notebook
jupyter nbconvert --to markdown \
    --execute "custom_tutorial.ipynb" \
    --output "custom_tutorial.md" \
    --template "admonition" \
    --TemplateExporter.extra_template_basedirs="../../../../workflow/notebook/nb_convert"

You can also use the --no-input flag if you don't want to show the code cells.

Then, add the Markdown file to the mkdocs.yaml navigation:

extra:
  social:
  - icon: fontawesome/brands/twitter
    link: https://twitter.com/NPGMgroup
  - icon: fontawesome/brands/github
    link: https://github.com/NBChub/bgcflow
markdown_extensions:
- attr_list
- admonition
- pymdownx.details
- pymdownx.superfences
nav:
- Home: index.md
- QC and Data Selection:
  - seqfu: seqfu.md
  - mash: mash.md
  - fastani: fastani.md
  - checkm: checkm.md
- Functional Annotation:
  - prokka-gbk: prokka-gbk.md
  - deeptfactor: deeptfactor.md
- Genome Mining:
  - antismash: antismash.md
  - query-bigslice: query-bigslice.md
  - bigscape: bigscape.md
  - bigslice: bigslice.md
  - arts: arts.md
  - cblaster-genome: cblaster-genome.md
  - cblaster-bgc: cblaster-bgc.md
- Phylogenomic Placement:
  - automlst-wrapper: automlst-wrapper.md
- Comparative Genomics:
  - roary: roary.md
  - eggnog-roary: eggnog-roary.md
- Custom Reports:
  - How to Add Custom Analysis to the BGCFlow Report: custom_tutorial.md
theme:
  icon:
    admonition:
      note: octicons/tag-16
      code: material/code-tags
...

Once you updated the mkdocs.yaml file, the site will be regenerated to show your newly added report. The report can be opened here http://localhost:8001/custom_tutorial/