Identify problems with predicted genes
This is a NESCent project, funded by Google during Google Summer of Code 2013 and mentored by Anurag Priyam and Yannick Wurm.
Resources
Summary
1) Length Validation via clusterization
2) Length Validation by ranking
5) Validation based on multiple alignment - for proteins only
6) Blast Reading Frame Validation - for nucleotides only
7) Open reading frame - for nucleotides only
Genome sequencing is now possible at almost no cost. However, obtaining accurate gene predictions remains a target hard to achieve with the existing technology.
We designed a tool that identifies problems with gene predictions, based on similarities with data from public databases (e.g Swissprot and Uniprot). We apply a set of validation tests that provide useful information about the problems which appear in the predictions, in order to make evidence about how the gene curation should be made or whether a certain predicted gene may not be considered in other analysis.
Our main target users are the biologists, who have large amounts of genetic data that has to be manually curated and use our tool to prioritize the genes that need to be checked and be aware about possible error causes.
In order to highlight the problems that appear in the predictions and suggest possible error causes, we developed 7 validation test:
1) Length validation via clusterization
2) Length validation via ranking
3) Duplication
4) Gene merge
5) Multiple alignment
6) Blast Reading Frame - applicable only for nucleotide sequences
7) Open reading frame - applicable only for nucleotide sequences
Data needed for each validation is retrieved from the BLAST output and used as follows (aliases and tags are those used in BLAST ‘outfmt’ argument):
x = mandatory parameter
|
Blast param |
sseqid |
sacc |
slen |
qstart |
qend |
sstart |
send |
length |
qseq |
sseq |
qframe |
Query raw seq |
Hit raw seq |
Validation |
Alias |
|
|
|
|
|
|
|
|
|
|
|
|
|
Length by clustering |
lenc |
x |
|
x |
|
|
|
|
|
|
|
|
|
|
Length by rank |
lenr |
x |
|
x |
|
|
|
|
|
|
|
|
|
|
Reading Frame |
frame |
x |
|
|
|
|
|
|
|
|
|
x |
|
|
Gene merge |
merge |
x |
|
|
x |
x |
x |
x |
|
|
|
|
|
|
Duplication |
dup |
x |
|
x |
|
|
x |
x |
x |
x |
x |
|
x |
x |
Open Reading Frame |
orf |
x |
|
|
|
|
|
|
|
|
|
|
x |
|
Multiple align. based |
align |
x |
x |
|
|
|
|
|
|
|
|
|
x |
x |
Each validation is described further in this section.
Error cause
|
Input data
|
Class information:
|
Workflow
[1] http://en.wikipedia.org/wiki/Unimodal_distribution#Unimodal_probability_distribution
|
|
Plots
(1) &nbnbsp; (2)
(3) (4) |
Linux and MacOS are officially supported!
$ git clone git@github.com:monicadragan/GeneValidator.git
$ sudo rake
$ genevalidator [validations] [skip_blast] [start] [tabular] [mafft] [raw_seq] FILE
Example that runs all validations on a set of ant gene predictions:
$ genevalidator -x data/prot_Solenopsis.xml data/prot_Solenopsis.fa
To learn more:
$ genevalidator -h
By running GeneValidator on your dataset you get numbers and plots. Some relevant files will be generated at the same path with the input file. The results are available in 3 formats:
! Note: for the moment check the html output with Firefox browser only !
(because our d3js cannot be visualized from other browsers without a local webserver)
5.1 Quickly test the tool or validate a small set of predictions (internet connection required).
$ genevalidator predictions.fasta
What happens backstage: blast hits and raw sequences are retrieved from remote databases and analysed for each query.
5.2 Precalculate the input files
5.2.1 Precalculate BALST output
If you have >100 queries & you access to a more powerful machine that pre-calculates overnight the blast output file. Use blastp for proteins and blastx for nucleotide queries.
a) If you have time and a really powerful machine you can take the blast xml output. This way you will save time when running GeneValidator (because you precalculate local alignments with BLAST)
Example:
# take blast output for protein queries (for mrna queries use blastx)
$ blastp -query predictions.fasta -db /path/to/db/nr/nr -outfmt 5 -max_target_seqs 200 -num_threads 48 -out predictions_blast.xml
# run validations
$ genevalidator -x predictions_blast.xml predictions.fasta
b) Take blast tabular output (customized so that you have all the necessary columns -- see the table with data needed for validations in section B)
Example:
# take blast output for protein queries
$ blastp -query predictions.fasta -db /path/to/db/nr/nr -outfmt "6 qseqid sseqid sacc slen qstart qend sstart send length qframe pident evalue" -max_target_seqs 200 -num_threads 48 -out predictions_blast.tab
# run validations
$ genevalidator -x predictions_blast.tab -t "qseqid sseqid sacc slen qstart qend sstart send length qframe pident evalue" predictions.fasta
5.2.1 Precalculate the raw sequences for the first n hits from a local BAST database, using the identifiers from your BLAST output
$ get_raw_sequences BLAST_OUTPUT_FILE -o OUTPUT_FILE [-t TABULAR_FORMAT] [-d DATABASES] [-n NR]
Example:
# get raw sequences from tabular BLAST output
$ get_raw_sequences predictions_blast.tab -o predictions.raw_seq -t "qseqid sseqid sacc slen qstart qend sstart send pident length qframe evalue" -d path/to/ncbi/blast/db/nr/nr
# get raw sequences from xml BLAST output
$ get_raw_sequences predictions_blast.xml -o predictions.raw_seq -d path/to/ncbi/blast/db/nr/nr
# run validations
$ genevalidator -x data/all_validations_prot/all_validations_prot.tab data/all_validations_prot/all_validations_prot.fasta -t "qseqid sseqid sacc slen qstart qend sstart send pident length qframe evalue" -r data/all_validations_prot/all_validations_prot.xml.raw_seq
5.3 You are interested in certain validations only (let’s take length validations and duplication check). Look up the CLI names for these validations in the table (Section B) and add them as argument so that GeneValidator will process those 3 only.
$ genevalidator -x predictions_blast.xml predictions.fasta -v ‘lenc lenr dup’
The code architecture is designed so that it is very easy to add new validations. There is a ‘template’ that the code of the new validation must fit, which is discussed further:
Extend 2 parent classes
Initialize internal variables (which initially have some default values):
Initialize internal variables (which initially have some default values):
|
Add the new validation to the toolAdd the validation to the validations list, which is further processed for yaml/html/console visualization # context code # validations = [] validations.push LengthClusterValidation.new(@type, prediction, hits, plot_path, plots) … ## what you need to add is this line that adds your validation class to the list of validations. Your class must extend ValidationTest ##
validations.push YourValidationClass(type, prediction, hits, other_arguments)
# context code # # check the class type of the elements in the list # this will raise an error if YourValidationClass does not extend ValidationTest validations.map do |v| raise ValidationClassError unless v.is_a? ValidationTest end
# run validations validations.map{|v| v.run}
# check the class type of the validation reports # this will raise an error if the run method of YourValidationClass does not return ValidationReport validations.map do |v| raise ValidationClassError unless v.validation_report.is_a? ValidationReport end |
Add plots
A. Generate json file according to the type of plot1) (clustered) bars plot, main cluster is in red: [ [{"key":109,"value":1,"main":false}], [{"key":122,"value":32,"main":true},{"key":123,"value":33,"main":true}], [...]] 2) scatter plot: [{"x":1,"y":626},{"x":1,"y":626}...] 3) lines: [{"y":1,"start":0,"stop":356,"color":"black"},{"y":1,"start":40,"stop":200,"color":"red"}...]
B. Create a plot object, according to the plot typeplot1 = Plot.new(filemane, plot_type # which can be :scatter, :line or :bar plot_title, legend**, xTitle, yTitle, aux_parameter1*, aux_parameter2*)
** legend is a String that _must_ have the following format: "item, color1;item with spaces, color2" * auxiliary parameters were used by now for passing the length of prediction (in length histogram) or the slope of the regression line (gene merge validation) or the number of lines (line plots)
C. Add the plot object to the plot_files array
@validation_report.plot_files.push(plot1) |
We prove the correctness of our approach by comparing the validation results of two releases of a genome project and by comparing validations of genes from weak and strong gene databases. The results show a quality improvement of the gene predictions from more recent releases (which have passed through intensive curation process) and for genes from stronger databases.
Comparison between two versions of Honey Bee [5]
The score difference between the corresponding sequences from two versions of Apis mellifera (honey bee) genome (versions 1.0 and 3.2) are presented in Figure 6. About 70% of the predictions in the most recent version improved after curation / genome reassembly (have better validation score than in the initial release).
Figure 6: Score differences Honey Bee 1.0 and 3.2
Comparison between a weak and a strong database
The plot in Figure 7 shows the score quality (maximum score is 100) for 1,000 random genes from Swiss-Prot database - in red (strong database, with manually annotated and reviewed sequences) and 1,000 random genes from TrEMBL - in blue (weak database - genes are automatically annotated and not reviewed). In Figure 8 is presented a histogram with the number of passed validations for each single validation (the colors are associated in the same manner).
Figure 7: Comparative outputs - overall score Figure 8: Comparative outputs - score by validation
Some validation outputs for protein and mRNA are available here:
http://swarm.cs.pub.ro/~mdragan/gsoc2013/genevalidator