Ants, Bees, Genomes & Evolution @ Queen Mary University London

BLAST woes: Refined nucleotide blast matrix

A colleague has modified genomic fire ant sequences. For most of them, BLAST will not turn up a homologue.

This is very probably linked to the fact that her sequences are modified: the bisulfite treatment she’d applied to them changes non-methylated Cs to Ts. So her sequences are mainly made up of As, Gs, and Ts, the Ts being way over-represented.

blastn is not good at finding these sequence’s homologues:

Poking around on the internet for alternatives did not turn anything up, so I asked on a bioinformatics.org mailing list. Peter Rice at EBI summed up the workaround hack:

  1. Build your blast database as protein

  2. Give your matrix a name that matches one of the blast protein matrix names (!)

  3. Put in the matrix values you want

  4. Remember that you are now using blastp (protein search) so you can only use a short wordsize - I am guessing you have short sequences anyway so this should not be a problem

  5. Remember that BLAST does local alignment.

  6. Remember that your scores will be making some wrong assumptions about using proteins. You should still find the hits you are looking for.

Contacting NCBI confirmed this… Wayne Matten pointed me towards a METHODS paper describing The Use of BlastP For Nucleic Acid Searches. He also indicated example matrices. So the next step was downloading and compiling NCBI Blast sources, and getting Apple-Genentech’s G5-optimized Blastall. Then for each nucleotide sequence database I wanted to blast against, I had to:

_ Note: -F F eliminates filtering of reduced complexity regions (repeats). -M BLOSUM80 specifies my custom matrix located in the directory specified in my ~/.ncbirc[NCBI] Data=/Users/admin/Documents/utils/blast_data_

For testing purposes, I used a subset of Apis mellifera’s 20050120 release: just one contig, vs. a sequence on which I simulated bi-sulfite treatement. This let me test the custom scoring matrix to give an increased difference in score between my test sequences real genomic and other similar sequences. Changes made included:

I thus used the following custom matrix:

`

Custom nucleotide alignment matrix for bisulfite-treated query sequence

A R N D C Q E G H I L K M F P S T W Y V B Z X * A 5 0 0 0 -4 0 0 -4 0 0 0 0 0 0 0 0 -4 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C -4 0 0 0 20 0 0 -4 0 0 0 0 0 0 0 0 -4 0 0 0 0 0 0 0 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G -4 0 0 0 -4 0 0 5 0 0 0 0 0 0 0 0 -4 0 0 0 0 0 0 0 H 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T -4 0 0 0 0 0 0 -4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

use warnings; use strict; use Bio::Seq; use Bio::SeqIO;

get command-line arguments, or die with a usage statement

my $usage = “reverseComplementOfInputFasta.pl infile outfile\n”; my $infile = shift or die $usage; my $outfile = shift or die $usage;

my $seq_in = Bio::SeqIO->new(‘-file’ => “< $infile”, ‘-format’ => ‘fasta’); my $seq_out = Bio::SeqIO->new(‘-file’ => “>$outfile”, ‘-format’ => ‘fasta’);

write each entry in the input file to the output file

while (my $inseq = $seq_in->next_seq) { my $reversecomplement = $inseq->revcom; my $id = $inseq->display_id() . “revcom”; $reversecomplement->display_id($id);

$seq_out->write_seq($reversecomplement); }

exit;


March 2, 2005