You need to have gone through Part 1: Read cleaning before starting this practical.
Many different pieces of software exist for genome assembly. We will be using SPAdes.
Create a new main directory for today's practical (e.g., 2020-09-xx-assembly
) and the input
, tmp
, and results
subdirectories. Link the output (cleaned reads) from yesterday's practical into input
subdirectory:
cd ~/2020-09-xx-assembly
ln -s ~/2020-09-xx-read_cleaning/results/reads.pe*.clean.fq input/
Did you note the use of *
in the above command? What do you think it means? (hint: it is called 'globbing')
To assemble our cleaned reads with SPAdes, run the following line. THIS WILL TAKE ABOUT 10 MINUTES
spades.py -o tmp -1 input/reads.pe1.clean.fq -2 input/reads.pe2.clean.fq
Like any other assembler, SPAdes creates many files, including a scaffolds.fasta
file that is likely to be used for follow-up analyses. Copy this file to your results directory:
cp tmp/scaffolds.fasta results/
Take a look at the contents of this file (e.g., less results/scaffolds.fasta
). Does it contain a lot of NNNN sequences? What do you think might be the reason for that? (Do not worry if your assembly does not contain NNNN sequences.)
There are many other genome assembly approaches. While waiting for everyone to make it to this stage, try to understand some of the challenges of de novo genome assembly and the approaches used to overcome them via the following papers:
How do we know if our genome is good?
"... the performance of different de novo* genome assembly algorithms can vary greatly on the same dataset, although it has been repeatedly demonstrated that no single assembler is optimal in every possible quality metric [6, 7, 8]. The most widely used metrics for evaluating an assembly include 1) contiguity statistics such as scaffold and contig N50 size, 2) accuracy statistics such as the number of structural errors found when compared with an available reference genome (GAGE (Genome Assembly Gold Standard Evaluation) evaluation tool [8]), 3) presence of core eukaryotic genes (CEGMA (Core Eukaryotic Genes Mapping Approach) [9]) or, if available, transcript mapping rates, and 4) the concordance of the sequence with remapped paired-end and mate-pair reads (REAPR (Recognizing Errors in Assemblies using Paired Reads) [10], assembly validation [11], or assembly likelihood [12])."* - Wences & Schatz (2015)
An assembly software will generally provide some statistics about what it did. But the output formats differ between assemblers. Quast, the Quality Assessment Tool for Genome Assemblies creates a standardized report. Run Quast (quast.py
) on the scaffolds.fasta
file. No special options - just the simple scenario to get some statistics.
Have a look at the report (pdf or html) Quast generated (hint: copy over quast's output directory to ~/www/tmp
).
What do the values in the table mean? For which ones is higher better, and for which ones is smaller better? Why do you think Quast uses the word "contig"?
Perhaps we have prior knowledge about the %GC content to expect, the number of chromosomes to expect, and the total genome size – these can inform comparisons with output statistics present in Quast's report.
Unfortunately, with many of the simple metrics, it is difficult to determine whether the assembler did things correctly, or just haphazardly stuck lots of reads together!
We probably have other prior information about what to expect in this genome. For example:
Pushing this idea further, we can expect a genome to contain a single copy of each of the "house-keeping" genes found in related species. This is applied in BUSCO (Benchmarking Universal Single-Copy Orthologs). Note that:
It is very important to understand the concepts underlying these different approaches.
Try to figure out what are the tradeoffs between de bruijn
graph and overlap-layout-consensus
assembly approaches.