Ants, Bees, Genomes & Evolution @ Queen Mary University London

Project structures for genomics analyses

How do you structure your files and folders for genomics analyses?

One challenge is that many analyses actually require multiple steps, thus having all steps in one place becomes a mess.

So we should structure our analyses across multiple folders. But how should we name them and keep track of their order?

Another (key) challenge in performing genomics analyses is that we often have to perform analyses multiple times.

So how do we keep track of the different steps and versions of analyses?

The standard approach we use for all projects in the lab is derived from ideas initially proposed by William Noble in A Quick Guide to Organizing Computational Biology Projects. That initial model has been adjusted based on our experience of dozens of projects over the years, as well as discussions with Julien Roux, Anurag Priyam, and Roddy Pracana.

Stable link here.

Best to just illustrate with an example of how this works at the simplest level.

Example:

2016-04-14-bombus_variant_calling
├── input
│   ├── 2016-04-14-bombus_raw_28_samples
│   │   ├── sample1.fq    #  could link to /data/SBCS-WurmLab/archive/db/genomic/reads/...                 
│   │   ├── sample2.fq 
│   │   ├── sample3.fq
│   │   ├── bombus_genome.fa -> ~/db/genomic/B_terrestris/Bter20110317-genome.fa
│   │   └── WHATIDID.sh  # list of ln -s, cp or wget/curl commands 
│   └── 2016-04-16-cleaned_reads
│       ├── sample1.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample1.clean.fq.gz
│       ├── sample2.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample2.clean.fq.gz
│       ├── sample3.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample3.clean.fq.gz
│       └── WHATIDID.sh  # just the ln -s commands.
├── results
│   ├── 2016-04-14-read_cleaning
│   │   ├── input        -> ../../input/2016-04-14-bombus_raw_28_samples
│   │   ├── results                                # only few files here
│   │   ├── sratoolkit   -> ../../soft/sratoolkit-2.4.2/bin/
│   │   ├── tmp                                    # use real scratch dir if more appropriate
|   |   ├── ENVIRONMENT.sh                         # if any particular software, modules or containers need to be loaded
│   │   └── WHATIDID.txt                           # or equivalent .sh or .Rmd (or knitr/jupyter)
│   ├── 2016-04-16-mapping_to_reference
│   │   ├── input        -> ../../input/2016-04-16-cleaned_reads
│   │   ├── results                                # only few files here
│   │   ├── tmp                                    # use real scratch dir if more appropriate
|   |   ├── ENVIRONMENT.sh                         # if any particular software, modules or containers need to be loaded
│   │   └── WHATIDID.txt                           # or equivalent .sh or .Rmd (or knitr/jupyter)
│   └── WHATIDID.txt                               # for overall rationale
└── soft
    ├── sratoolkit-2.4.2                           # if installed locally
    ├── bwa              -> /share/apps/sbcs/bwa/0.6.2/bin/bwa
    └── # links to other software if needed

Explicit (partial) conventions

Conventions include:


October 1, 2018