Ants, Bees, Genomes & Evolution

@ Queen Mary University London

Performance and user centric updates to Afra's annotation editor

Written by Hiten Chowdhary, cross-posted from (http://www.hiten.io/blog/articles/gsoc-17/)[http://www.hiten.io/blog/articles/gsoc-17/]

This is a summary of the work done during this summer as part of Google Summer of Code 2017 under the organisation Open Genome Informatics under the guidance of my mentors Yannick Wurm and Anurag Priyam.

About the project

Problem Statement: Performance and user centric improvements to Afra’s annotation editor.

Brief explanation: Gene prediction models are visually inspected and manually corrected for any mistakes. Curation of gene models is carried out on Afra, a crowdsourcing platform. Afra has two models - an annotation editor and a task processor. The annotation editor is build using JBrowse and WebApollo. This project focuses on migrating Afra to the latest JBrowse and getting a unit test suite ready to optimize the annotation editor and ease the learning curve of manual curation.

Project Summary

Before the project started, I was fairly acquainted with javascript and ruby. Initially, I got familiar with JBrowse’s codebase and understood the upgrades along with changes made in different versions. We planned to get the test suite ready before the migration of Afra takes place. With the test suite in place, the migration would take place without any hassle. One can run tests to ensure annotation editing functionality is working properly.

For migration of Afra, we took a look at JBrowse and Afra’s codebase. We examined the files added to Afra’s codebase, which exhibited the annotation editor functionality along with additional features of Afra. These differences in codebase were carried over to JBrowse as a plugin. This provided the annotation editor functionality which could be easily plugged into JBrowse or any other Genome browser.

Test Suite

Annotation editor uses Jasmine for unit tests. Jasmine was easy to setup and can be executed using a simple web server. I examined the already implemented tests for different functionalities of Afra. This provided me an insight into the processing of tests implemented in Afra. Further, I added some tests for the annotation editor functionality. These tests were:

  • Flip strand : Checks whether the annotation editor is correctly changing the genome’s strand.
  • Set longest ORF: Checks whether annotation editor adequately implements the longest open reading frame for a given transcript.
  • Mark non canonical splice sites: Checks whether the annotation editor is duly marking the non canonical splice sites for the transcript.
  • Mark non canonical translation start site: Checks whether the annotation editor is duly marking the non canonical start translation sites for the transcript.
  • Mark non canonical translation stop site: Checks whether the annotation editor is duly marking the non canonical splice sites for the transcript.
  • Filter features: Checks whether the annotation editor is properly filtering a given type of transcript (say exons) from a given set of transcripts.
  • Copy feature: Checks whether annotation editor is making an exact copy of features passed to it.
  • Merge exon: Checks whether annotation editor suitably merges two given exons and returns a new transcript.
  • Create transcript: Checks whether annotation editor creates a new transcript using Simple feature, for the provided features.
  • Resize exon: Checks whether annotation editor adequetely resizes the exons.
  • Get CDNA coordinates: Checks whether the annotation editor provides correct cdna coordinates for a given transcript.
  • Delete exon: Checks whether annotation editor properly deletes a exon from a given transcript.

Here is the link to my commits for the test suite.

Migrating Afra to latest JBrowse

We started building a plugin for the JBrowse to carry over the annotation editor functionality to JBrowse. This would provide Afra’s annotation edition functionality to use the latest JBrowse features, making the migration process smoother. The plugin development took place in the following steps :-

  • Edit track implementations : First task was adding a new track to JBrowse’s browser for plugin’s initialization.
  • Adding drag and drop feature : Next task was including proper jquery scripts and modifying the properties of existing tracks, so that they can be dragged along the y axis and dropped to edit track for annotating them.
  • Edit track drop capabilities : Subsequently, we initialized the edit track and made it capable to accept the incoming features and make them available for annotation editor for further modifications.
  • Accessing annotation editor’s functionality : Now that we could successfully dragged and dropped a feature in edit track, it was time to use the annotation editor functionality to make required changes to the track. This list of functionality is available upon right clicking the feature in edit track.
  • Getting right click menu items working: By this stage we had got a front end in place to access the annotation editors functions. Now it was time to get those functionality such as get sequence, send to Gene Validator, resize exon, etc ready.

Now we had a basic plugin in place that had implemented the basic annotation editing functionality. (commit)

Now additional features of Afra had to be carried over. These feature are :-

  • We implemented triple click on track to zoom to base pair level directly. (commit)
  • Following this, we replaced the JBrowse’s reference sequence track with Afra’s. Now, when the feature in edit track is selected, the corresponding track in reference sequence gets highlighted. (commit)
  • We added a feature to view the residues (actual genomic sequence ‘atgc’) of the feature in edit track when it is selected. (commit)
  • We implemented an additional feature to validate the feature dropped in the edit track. We checked for non canonical translation start and stop sites and marked them with orange exclamation marks. This will help the user easily look for non canonical splice sites without looking into the sequence of the feature. (commit)

Finally, we had successfully implemented the annotation editor of Afra as a plugin of JBrowse, along with extra features of Afra too. Further, to test whether all the annotation editing functionalities are working properly we implemented the test suite for the plugin.(commit)

August 29, 2017


Emeline and Natalia at Evolution meeting (Portland, Oregon)

The Evolution meeting took place this June in Portland, Oregon. This meeting is organised every year and is one of the largest conferences in our field.

Creating their schedule from an incredibly rich program of up to 13 parallel sessions, Natalia and Emeline caught up with the latest on social systems, genomics and evolutionary stories. They also participated in the conference:

  • Emeline presented the results of her first year of PhD on the genomics of the socially polymorphic ant species Pheidole pallidula
  • Natalia gave a talk on the DNA methylation on bees. This was the final presentation before her PhD viva. We are very happy to congratulate her on successfully defending her thesis - congratulations Dr Natalia Araujo!

The meeting was a wonderful occasion to catch up with collaborators and presenting the Wurmlab research to this (mainly) North American audience. Next year, Evolution will join ESEB in Montpellier (France), for a joint meeting.

July 10, 2017


Joe, Carlos and Priyam present at Insect Genomics meeting

The Royal Entomological Society’s Insect Genomics Interest Group meeting took place on May 16th at the Rothamsted Research Centre.

The speakers to this event included highly relevant researchers from all over the world, covering a wide range of subjects in insect genomics, from pest control to evolutionary biology. Of course the Wurmlab could not miss this chance and Carlos Martinez-Ruiz, Joe Colgan and Anurag Priyam attended the event to disseminate our research:

  • Joe gave a talk on the signatures of positive selection in the genomes of bumblebees (Bombus terrestris) from populations around the UK
  • Carlos presented the preliminary results of his PhD project on the signatures of evolutionary conflict in the transcriptome of the fire ant Solenopsis invicta
  • Priyam presented a poster explaining the benefits of using SequenceServer as a BLAST interface and using GeneValidator to determine quality of gene predictions.

The meeting was very fruitful, and offered exciting opportunities for collaborations between the Wurmlab and other researchers from institutions in the UK and abroad.

We look forward to the next meeting!

May 16, 2017


Paper on Social Chromosomes Accepted

Our paper on the genetic diversity of the fire ant social chromosomes has been accepted in Molecular Ecology and is now available online!

The fire ant social chromosomes carry a supergene that controls the number of queens in a colony. We describe a few features of this supergene system:

  • The two variants of the social chromosomes are differentiated from each other over the supergene region, but without any evidence of evolutionary strata.
  • There is a large number non-synonymous substitutions between the two variants.
  • The never recombining variant Sb is almost fixed in the North American population.

You can check out the press release, which covers some of the details about our work.

The full reference is: R Pracana, A Priyam, I Levantis, RA Nichols and Y Wurm. (2017) The fire ant social chromosome supergene variant Sb shows low diversity but high divergence from SB Molecular Ecology. DOI: 10.1111/mec.14054

February 21, 2017


Brief New Year's update

Just a brief update to:

January 30, 2017


Excellent work by GSoC Bioinformatics students

Google summer of code 2016 has just came to an end. Thanks to our host organisations Open Genome Informatics and Open Bioinformatics Foundation, we’ve had a productive summer with two excellent students. Both students wrote blog posts summarizing their work.

As the finishing touches are implemented, we look forward to being able to deploy the work of these students into production releases of SequenceServer and Bionode.

September 1, 2016


Blast Visualization Google Summer of Code

Written by Hiten Chowdhary, cross-posted from http://www.hiten.io/blog/articles/gsoc-16/

This post is going to be about my GSoC 2016 project under Open Genome Informatics organisation along with Anurag Priyam and Yannick Wurm as my mentors.

About Project

Problem statement: BLAST visualizations library for Bioruby and SequenceServer. Brief explanation: It is now trivial to generate large amounts of DNA sequence data; the challenge lies in making sense of the data. In many cases, researchers can gain important insights by comparing newly obtained sequences to previously known sequences using BLAST (>100,000 citations). This project will focus on creating a reusable visualizations library for BLAST results and integrating it with SequenceServer, a popular BLAST server. BLAST results are text based but lack rich visual representation. Having a visualizations can greatly facilitate interpretation of data.

Warming Up

Before the project has started I was fairly acquainted with Ruby and Javascript. So I started with small bug fixes in order to get acquainted with the SequenceServer code. SequenceServer provided support for downloading the report generated in XML or TSV format. When one clicks the download button, it would generate the files and store it as tmp file until it is completely downloaded. But this process was repeated every time the download button was clicked, so we decided to save the tmp files generated in two formats, so if user needs it again no need to generate the file instead directly start the download. I also played around with some error handling issues just to get comfortable with the ruby part of the project. I helped improving the XML parsing of files and check for integrity issues and cases when a specific report user was searching and not found. I added checks in various places and raised appropriate error messages to help user figure out what was going wrong.

Visualizations

So by this time I had played around the SequenceServer code enough to know how it was working and it was time to get down to the real part of the project. I started up with Length Distribution graph. It is a simple histogram representing hit sequences length frequency. The rectangles were colored using a grey scale, where the darker the shade the more significant the hit is. This graph provided user with an idea about the all the length of the hit sequence and the length of the query sequence when one hovers over the rectangles. It will also help user in annotations, identifying proteins across species, validating gene predictions, etc The graph was drawn using d3.js.

length-distribution length-distribution-hover

Next I started with Circos visualizations. Currently SequenceServer has Query against Hit to show alignment mapping between hit sequence and query sequence, Alignment overview to show alignment mapping between query sequence and all its hit sequences. Now Circos visualizations will add alignment mapping of multiple query and hit sequences to its arsenal. Circos visualizations is simple circos based graph with chords representing query sequence and hit sequence and ribbons represent the alignment mappings. The chords representing the query sequence is green in color and the others representing the hit sequence are blue in color. The ribbons are colored in red-yellow-blue color schema with red representing the most significant hit for a query and blue as least significant hit. One can hover over the ribbons to view its details such as the area this specific alignment covers on query or hit sequence, and the probability that this match was by chance. One can click on a chord to view its alignments. This was drawn using CircosJs.

circos
circos-select circos-hover

Later I started with refinements of the previous graphs that SequenceServer provided. Now that we have four different visualizations and many of them use a lot of common code we decided to make the code modular, in order to make the code look better and to make adding new visualizations in the future, an much easier task and also to make changes in current ones easily. In Query against Hit the polygons were numbered alphabetical in order make it easier for user to understand which polygon corresponds to which alignment details provided below the graph.

Query against Hit

For Alignment Overview I refactored the code to use ES6 modules, which is used by all the other visualizations too. I reduced the height of each graph so that at one user can view what options are being provided and then ahead accordingly. User can download the graphs in SVG or PNG formats.

alignment-overview alignment-overview-hover

My initial proposal was to add four new visualizations, but after detailed discussion with my mentors we decided that with the level of detailing required by one visualization we should limit ourselves to two.

Here is my list of commits

August 23, 2016


Google Summer of Code 2016

Congratulations to our 2016 Google Summer of Code students! We are pround & excited to host them:

April 25, 2016


Sequenceserver manuscript

Happy to announce that we now have a manuscript describing the rationale and current features of SequenceServer - our easy to setup BLAST frontend. Importantly, the manuscript also provides extensive detail about the sustainable software development and user-centric design approaches we used to build this software. The full bioRxiv reference is:

Sequenceserver: a modern graphical user interface for custom BLAST databases 2015. Priyam, Woodcroft, Rai, Munagala, Moghul, Ter, Gibbins, Moon, Leonard, Rumpf and Wurm. bioRxiv doi: 10.1101/033142 [PDF].

Be sure to check out the interactive figure giving a guided tour of Sequenceserver’s BLAST results.

Finally, I’ll note that Sequenceserver arose from our own needs; these are clearly shared by many as Sequenceserver has already been cited in ≥20 publications and has been downloaded ≥30,000 times! Thanks to all community members who have made this tool successful.

February 1, 2016


Avoid having to retract your genomics analysis.

Please cite the Winnower version of this article

2016 Edit: slides are from a 15 minute talk representative of this blog at Popgroup december 2015 in Edinburgh:

Biology is a data-science

The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high performance computing.

This is exciting & empowering – in particular for small teams working on emerging model organisms that lacked genomic resources. But with great powers come great responsibilities… and risks of doing things wrong. These risks are far greater for genome biologists than, say physicists or astronomers who have strong traditions of working with large datasets. In particular:

  • biologist researchers generally learn data handling skills ad hoc with little knowledge of best practices;
  • PIs – having never themselves handled huge datasets – have difficulties critically evaluating the data and approaches;
  • new data are often messy with no standard analysis approach; even so-called “standard” analysis methodologies generally remain young or approximative;
  • analyses intending to identify biologically interesting patterns (e.g., genome scans for positive selection, GO/gene set enrichment analyses) will enrich for technical artifacts and underlying biases in the data;
  • data generation protocols are immature & include hidden biases leading to confounding factors (when things you are comparing differ not only according to the trait of interest but also in how they were prepared) or pseudoreplication (when one independent measurement is considered as multiple measurements).

Invisible mistakes can be costly

Crucially, data analysis problems can be invisible: the analysis runs, the results seem biologically meaningful, and are wonderfully interpretable but they may in fact be completely wrong.

Geoffrey Chang’s story is an emblematic example. By the mid-2000s this young superstar professor crystallographer had won prestigious awards and published high-profile papers providing 3D-structures of important proteins. For example:

  • Science (2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters.
  • Journal of Molecular Biology (2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation.
  • Science (2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide.
  • Science (2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate. 310:1950-1953.
  • PNAS (2004) Ma & Chang. Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli.

But in 2006, others independently obtained the 3D structure of an ortholog to one of those proteins. Surprisingly, the orthologous structure was essentially a mirror-image of Geoffrey Chang’s result.

Rigorously double-checking his scripts, Geoffrey Chang then realized that: “an in-house data reduction program introduced a change in sign [..,]”.

In other words, a simple +/- error led to plausible and highly publishable but dramatically flawed results. He retracted all five papers.

Devastating for him, for his career, for the people working with him, for the hundreds of scientists who based follow-up analyses and experiments on the flawed 3D structures, and for the taxpayers or foundations funding the research. A small but costly mistake.

Approaches to limit the risks

A +/- sign mistake seems like it should be easily detectable. But how do you ensure that experiments requiring complicated data generation and complex analysis pipelines with interdependencies and sophisticated data structures yield correct results?

We can take inspiration from software developers in internet startups: similarly to academic researchers, they form small teams of qualified people to do great things with new technologies. Their approaches for making software robust can help us make our research robust.

An important premise is that humans make mistakes. Thus (almost) all analysis code includes mistakes (at least initially; this includes unix commands, R, perl/python/ruby/node scripts, et cetera). Increasing robustness of our analyses thus requires becoming better at detecting mistakes – but also ensuring that we make fewer mistakes in the first place. Many approaches exist for this. For example:

  • Every additional chunk of code can contain additional mistakes. Write less code, you’ll make fewer mistakes. For this we should try to reuse our own code and that of others (e.g., by using bio* libraries).
  • Every subset/function/method of every piece of code should be tested on fake data (edge cases) to ensure that results are as expected (see unit and integration testing). It can be defendable to write the fake datasets and tests even before writing analysis code.
  • Continuous integration involves tests being automatically rerun (almost) instantly whenever a change is made anywhere in the analysis. This helps detect errors rapidly before performing full analyses.
  • Style guides define formatting and variable naming conventions (e.g., for ruby or R). Respecting one makes it easier for you to go back over your analysis two years later (e.g., for paper revisions or a new collaboration); and for others to reuse and improve it. Tools can automatically test whether your code is in line with the style guide (e.g., RLint, Rubocop, PyLint).
  • Rigorously tracking data and software versions and sticking to them reduces risks of unseen incompatibilities or inconsistencies. A standardized project structure can help.
  • Code reviews: having others look through your code – by showing it to them in person, or by making it open source – helps to learn how to improve code structure, to detect mistakes and to ensure that our code will be reusable by ourselves and others.
  • There are specialists who have years of experience in preventing and detecting mistakes in code or analyses. We should hire them.
  • Having people independently reproduce analyses using independent laboratory and computational techniques on independently obtained samples might be the best validation overall…

This list overlaps at least in part with what has been written elsewhere and my coursework material. In my lab we do our best to follow best practices for the bioinformatics tools we develop and our research on social evolution.

Additionally, the essentials of experimental design are long established: ensuring sufficient power, avoiding confounding factors & pseudoreplication (see above & elsewhere), and using appropriate statistics. Particular caution should be used with new technologies as they include sources of bias that may not be immediately obvious (e.g. Illumina lane, extraction order…).

There is hope

There is no way around it: analysing large datasets is hard.

When genomics projects involved tens of millions of $, much of this went to teams of dedicated data scientists, statisticians and bioinformaticians who could ensure data quality and analysis rigor. As sequencing has gotten cheaper the challenges and costs have shifted even further towards data analysis. For large scale human resequencing projects this is well understood. Despite the challenges being even greater for organisms with only few genomic resources, surprisingly many PIs, researchers and funders focusing on such organisms suppose that individual researchers with little formal training will be able to perform all necessary analysis. This is worrying and suggests that important stakeholders who still have limited experience of large datasets underestimate how easily mistakes with major negative consequences occur and go undetected. We may have to see additional publication retractions for awareness of the risks to fully take hold.

Thankfully, multiple initiatives are improving visibility of the data challenges we face (e.g., 1, 2, 3, 4, 5, 6). Such visibility of the risks – and of how easy it is to implement practices that will improve research robustness – needs to grow among funders, researchers, PIs, journal editors and reviewers. This will ultimately bring more people to do better, more trustworthy science that will never need to be retracted.

Acknowledgements

This post came together thanks to the SSI Collaborations workshop, Bosco K Ho’s post on Geoffrey Chang, discussions in my lab and through interactions with colleagues at the social insect genomics conference and the NESCent Genome Curation group. YW is funded by the Biotechnology and Biological Sciences Research Council [BB/K004204/1], the Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and is a fellow of the Software Sustainablity Institute.

Please cite The Winnower version of this article

June 2, 2015


All Posts >>