Genome sequencing is now possible at almost no cost. However, obtaining accurate gene predictions remains a target hard to achieve with the existing technology.

We designed a tool that identifies problems with gene predictions, based on similarities with data from public databases (e.g Swissprot and Uniprot). We apply a set of validation tests that provide useful information about the problems which appear in the predictions, in order to make evidence about how the gene curation should be made or whether a certain predicted gene may not be considered in other analysis.

Our main target users are the biologists, who have large amounts of genetic data that has to be manually curated and use our tool to prioritize the genes that need to be checked and be aware about possible error causes.

B. Validations

In order to highlight the problems that appear in the predictions and suggest possible error causes, we developed 7 validation test:

1) Length validation via clusterization

2) Length validation via ranking

3) Duplication

4) Gene merge

5) Multiple alignment

6) Blast Reading Frame - applicable only for nucleotide sequences

7) Open reading frame - applicable only for nucleotide sequences

Data needed for each validation is retrieved from the BLAST output and used as follows (aliases and tags are those used in BLAST ‘outfmt’ argument):

x = mandatory parameter

Blast param

sseqid

sacc

slen

qstart

qend

sstart

send

length

qseq

sseq

qframe

Query raw seq

Hit raw seq

Validation

Alias

Length by clustering

lenc

Length by rank

lenr

Reading Frame

frame

Gene merge

merge

Duplication

dup

Open Reading Frame

orf

Multiple align. based

align

Each validation is described further in this section.

1) Length Validation via clusterization

Error causes

sequencing error: some parts of the gene were lost/added on the way or gene bounds were not well estimated

the gene was low expressed
the sequenced mRNA incorrectly contains an some introns

Input data

lengths of the hits: this data is retrieved after parsing the blast output file
length of the prediction: this number is the length of the current query from the fasta. In case of nucleotide sequences, we are interested in the length of the corresponding query translated into protein (which is the length of the query divided by 3)

Class information:

header: Length Cluster
short description: Check whether the prediction length fits most of the BLAST hit lengths, by 1D hierarchical clusterization. Meaning of the output displayed: Prediction_len [Main Cluster Length Interval]
alias: lenc

Workflow

Aim : we are interested to find out if the length of the predicted sequence belongs to the distribution of the hit lengths (in other words, how close is the length of the prediction to the majority of the lengths of the hits)

By plotting the histogram of the length distribution of the hits we observe that the distribution does not fit a Bell Curve (see Figure 1), therefore we cannot apply the classical T-test.

(1)

Our approach to find the majority of lengths among the reference lengths uses a typical hierarchical clusterization:

Firstly we assume that each length belongs to a separate cluster. Each step we merge the closest two clusters, until a cluster that contains more than 50% of the reference sequences is obtained. Each clusters is represented with a different colors in Figure 2. The one colored in red is called the “main cluster”.

(2)

Finally we are interested to check whether the length of the prediction belongs to the main cluster or not. The validation test will pass if the length of the prediction belongs to the main cluster of lengths (see Figure 3) and will fail otherwise (Figure 4).

(3) (4)

Plots

length distribution histogram (prediction passes the validation test in Figure 3 and fails in Figure 4)