Wurm lab: home | |

Part 2: Genome assembly

You need to have gone through Part 1: Read cleaning before starting this practical.

Offline exercise

Find (or make) some friends; find a table. In groups of 4 or 5, ask an assistant for an assembly task.

Brief assembly example / concepts

Many different pieces of software exist for genome assembly. We will be using SOAPdenovo.

Create a new input/02-assembly directory and link the output from yesterday’s practical into it. Make a new results/02-assembly directory. Create a link between input/02-assembly and results/02-assembly/input.

To assemble our cleaned reads with SOAPdenovo, we create a soap_config.txt file containing the following:

max_rd_len=101          # maximal read length
[LIB]            # One [LIB] section per library
avg_ins=470             # average insert size
reverse_seq=0           # if sequence needs to be reversed
asm_flags=3             # use for contig building and subsequent scaffolding
rank=1                  # in which order the reads are used while scaffolding
q1=input/reads.pe1.clean.fq
q2=input/reads.pe2.clean.fq

Then run the following line. THIS IS RAM-INTENSE, your computer may struggle

soapdenovo2-63mer all -s soap_config.txt -K 63 -R -o assembly

Like any other assembler, SOAPdenovo creates many files, including an assembly.scafSeq file that is likely to be used for follow-up analyses. Why does this file contain so many NNNN sequences?

There are many other genome assembly approaches. While waiting for everyone to make it to this stage, try to understand some of the challenges of de novo genome assembly and the approaches used to overcome them via the following papers:

At home: what are the tradeoffs between a de bruijn graph assembly approach and an overlap-layout-consensus assembly approach?

Quality assessment

How do we know if our genome is good?

“… the performance of different *de novo genome assembly algorithms can vary greatly on the same dataset, although it has been repeatedly demonstrated that no single assembler is optimal in every possible quality metric [6, 7, 8]. The most widely used metrics for evaluating an assembly include 1) contiguity statistics such as scaffold and contig N50 size, 2) accuracy statistics such as the number of structural errors found when compared with an available reference genome (GAGE (Genome Assembly Gold Standard Evaluation) evaluation tool [8]), 3) presence of core eukaryotic genes (CEGMA (Core Eukaryotic Genes Mapping Approach) [9]) or, if available, transcript mapping rates, and 4) the concordance of the sequence with remapped paired-end and mate-pair reads (REAPR (Recognizing Errors in Assemblies using Paired Reads) [10], assembly validation [11], or assembly likelihood [12]).”* - Wences & Schatz (2015)

Simple metrics

An assembly software will generally provide some statistics about what it did. But the output formats differ between assemblers. Quast, the Quality Assessment Tool for Genome Assemblies creates a standardized report. Run Quast (quast.py) on the assembly.scafSeq file. No special options - just the simple scenario to get some statistics.

Have a look at the report (pdf or html) Quast generated.

What do the values in the table mean? For which ones is higher better, and for which ones is smaller better? Is Quast’s use of the word “contig” appropriate?

Perhaps we have prior knowledge about the %GC content to expect, the number of chromosomes to expect, and the total genome size – these can inform comparisons with output statistics present in Quast’s report.

Biologically meaningful measures

Unfortunately, with many of the simple metrics, it is difficult to determine whether the assembler did things correctly, or just haphazardly stuck lots of reads together!

We probably have other prior information about what to expect in this genome. For example:

  1. If we have a reference assembly from a not-too-distant relative, we could expect large parts of genome to be organised in the same order, i.e., synteny.
  2. If we independently created a transcriptome assembly, we can expect the exons making up each transcript to map sequentially onto the genome (see TGNet for an implementation).
  3. We can expect the different scaffolds in the genome to have a unimodal distribution in sequence read coverage. Similarly, one can expect GC% to be unimodally distributed among scaffolds. Using this idea, the Blobology approach determined that evidence of foreign sequences in Tardigrades is largely due to extensive contamination rather than extensive horizontal gene transfer Koutsovoulos et al 2016.
  4. We can expect different patterns of gene content and structure between eukaryotes and prokaryotes.
  5. Pushing this idea further, we can expect a genome to contain a single copy of each of the “house-keeping” genes found in related species. This is applied in BUSCO (Benchmarking Universal Single-Copy Orthologs). Note that:
    • BUSCO is a refined, modernized implementation of CEGMA (Core Eukaryotic Genes Mapping Approach). CEGMA examines a eukaryotic genome assembly for presence and completeness of 248 “core eukaryotic genes”.
    • QUAST also includes a “quick and dirty” method of finding genes.

It is very important to understand the concepts undelrying these different approaches.

Gene prediction

Many tools exist for gene prediction, some based on ab initio statistical models of what a protein-coding gene should look like, others that use similarity with protein-coding genes from other species, and others (such as Augustus and SNAP), that use both. There is no perfect tool or approach, thus we typically run many gene-finding tools and call a consensus between the different predicted gene models. MAKER and JAMg can do this for us. Let’s use MAKER on a sandbox example.

Start in a new directory (e.g., ~/2017-09-29-reference_genome/results/03-gene_prediction). Pull out the longest few scaffolds from the assembly.scafSeq (e.g., using seqtk seq -L 20000) into their own fasta (e.g., min20000.fa).

Running maker -OPTS will generate an empty maker_opts.ctl configuration file (ignore the warning). Edit that file to specify:

For a real project, we would include RepeatMasker (perhaps after creating a new repeat library), we would provide as much relevant information as possible (e.g., RNAseq read mappings, transcriptome assembly – both improve gene prediction performance tremendously), and iteratively train gene prediction algorithms for our data including Augustus and SNAP.

Run maker maker_opts.ctl. This may take a few minutes, depending on how much data you gave it. Once its done the results will be hidden in subdirectories of *maker.output/min20k_datastore. Perhaps its easier to find the gene predictions using find then grep for gff or proteins. You can ignore the (temporary) contents under theVoid directories.

Quality control of individual genes

So now we have some gene predictions… how can we know if they are any good? The easiest way to get a feel for this is by comparing a few of them (backup examples) to known sequences from other species. For this, launch a local BLAST server to compare a few of your protein-coding gene predictions to the high quality predictions in swissprot:

# download swissprot database
cd PATH_WHERE_I_WANT_TO_HAVE_DATABASES

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
makeblastdb -in  uniprot_sprot.fasta  -dbtype 'prot' -parse_seqids

# Run BLAST server:
sequenceserver -d PATH_TO_BLAST_DATABASE_DIRECTORY

Do any of the gene predictions have significant similarity to known sequences? For a given gene prediction, do you think it is complete, or can you infer from the BLAST alignments that something may be wrong?


As you can see, gene prediction software is imperfect – this is even the case when using all available evidence. This is potentially costly for analyses that rely on gene predictions - i.e. many of the analyses we might want to do!

“Incorrect annotations [ie. gene identifications] poison every experiment that makes use of them. Worse still the poison spreads.”Yandell & Ence (2012).

The GeneValidator tool can help to evaluate quality of a gene prediction by comparing features of a gene prediction to similar database sequences. This approach expects that similar sequences should for example be of similar length.

You can simply run genevalidator -d ~/2017-09-BIO721_genome_bioinformatics_input/reference_databases/uniprot/uniprot_sprot.fasta proteins.fasta (on your gene predictions, or these examples), or use the web service for queries of few sequences. Alternatively just check the screenshots linked in the next sentence. Try to understand why some gene predictions have no reason for concern (e.g.), while others do (e.g.).

Comparing whole genesets & prioritizing genes for manual curation

Genevalidator’s visual output can be handy when looking at few genes. But the tool also provides tab-delimited output, handy when working in the command-line or when running the software on whole proteomes. For example, this can help analysis:

Manual curation

Because automated gene predictions aren’t perfect, manual inspection and fixing are often required. The most commonly used software for this is Apollo/WebApollo. In the following practical, we will be using a different, Apollo-like curation software (Afra) to edit the gene prediction, e.g. adding or removing exons, merging or splitting gene models, and adjusting exon boundaries.