BGCflow data structure
Overview
Teaching: 0 min
Exercises: 5 minQuestions
Where can I find the analysis result of my run?
Objectives
Finding relevant BGCflow analysis result
Exploring the output of BGCflow
In this workshop, we have finished running all analysis for our s_venezuelae project. You can find the result in the VM at: /datadrive/bgcflow/data
.
tree -L 2 /datadrive/bgcflow/data/
You can also generate a symlink to that directory so you can explore it using VS Code:
tree -L 3 /datadrive/bgcflow/data/
BGCflow adopt the cookiecutter data science directory structure. Output files can be found in the data
directory, and are splitted into three different stages:
- The
processed
directory contained most of the output required for downstream analysis - The
interim
directory contained a direct output from each rules that are not fined tuned for downstream analysis - The
raw
directory contained user provided input files
Give yourself time to look through the different output directories.
Which files are important?
It depends. Different research questions will require different analysis, and therefore different files are required for downstream analysis. In the Natural Products Genome Mining group, we aim to aid students and researchers by giving an example of Jupyter notebooks to process each output types. It is a work in progress and we are open to anyone who would like to contribute.
In the next session, I will give an example to do exploratory data analysis on BiG-SLICE query result against the BiG-FAM database.
Key Points
BGCflow adopt the cookiecutter data science directory structure
Output files can be found in the
data
directory, and are splitted into three different stagesThe
processed
directory contained most of the output required for downstream analysis