Ants, Bees, Genomes & Evolution @ Queen Mary University London



GeneValidator screenshot

GeneValidator [ Source code ]   [ Demo server ]   [ Publication ]

GeneValidator is a tool to identify problematic gene predictions based on comparisons between gene predictions and similar sequences in public databases (e.g., SwissProt). Funded by NESCent Google Summer of Code 2013 and BBSRC TRDF.

GeneValidator works locally from the command line or from a web browser (suitable for <100 queries). Either amino acid sequences or nucleotide sequences (e.g., putative CDS) can be used as input. Output formats include HTML report and plain (parseable) text. Some examples below.

Description Input Output*
Nucleotide sequences (picked from European Nucleotide Archive) FASTA HTMLJSONCSVSummary CSV
Amino acid sequences (picked from ongoing genome projects) FASTA HTMLJSONCSVSummary CSV

* Generated using GeneValidator 2.1.5, using SwissProt database downloaded on 17th August, 2018 as reference.

Installation and usage

To install GeneValidator on a Unix-based system (e.g. Linux or Mac OS), please run the following in the terminal:

sh -c "$(curl -fsSL https://install-genevalidator.wurmlab.com)"

Please see this page for more information on installation and usage.

Alternatively, we host a web server appropriate for < 10 query sequences at a time:

Mini-tutorial for interactive use on one or few gene predictions

Despite recent improvements in genome sequencing and gene prediction technologies, many gene predictions remain problematic. GeneValidator can be used to help assess the quality of a large set of gene predictions, but also for individual sequences. Here we focus on the latter.

  1. Take one or several gene predictions in FASTA format - protein (e.g., A, B, C) or nucleotide sequence (e.g., D).
  2. Go to the GeneValidator web app.
  3. Paste your gene prediction into the text field and click the Analyse Sequences button.
    GeneValidator will BLAST your gene prediction against a database (default: SwissProt), and perform multiple comparisons between your gene prediction and the sequences in the database. This should take less than 2 minutes.
  4. Examine the output report:
    • GeneValidator will only report results if it identified sufficient similar sequences in the database.
    • Each test result is shown in a different column. The question mark buttons provide details about the test.
    • Each test result is accompanied by an indication of consistency between the gene prediction and the BLAST hits.
    • GeneValidator produces visual graphs to help understand the characteristics of the gene prediction data.

What can you conclude regarding your query gene prediction sequences? Regarding the example sequences given above, the following can help you understand GeneValidator's output:

  1. There appear to be several problems with this gene:
    • It is longer than most BLAST hits (see Length Cluster graph)
    • Each BLAST hit aligns either to the first part or to the second part of the query sequence (see second Gene Merge graph). This (along with the first Gene Merge graph) suggests the query may be a fusion of two genes (this happens occasionally for tandem genes).
  2. There is no evidence of any problems with this gene.
  3. A region of the gene aligns multiple times to a single BLAST hit as indicated by the duplication result. This suggests that our query gene prediction may include a single exon twice (e.g., as a result of prediction software incorrectly merging tandem (adjacent) duplicated gene copies into a single prediction.
  4. This sequence likely contains a frameshift. This is indicated by BLAST hits not all aligning in a single reading frame and by the presence of two main open reading frames.