Written by Hiten Chowdhary, cross-posted from (http://www.hiten.io/blog/articles/gsoc-17/)[http://www.hiten.io/blog/articles/gsoc-17/]
This is a summary of the work done during this summer as part of Google Summer of Code 2017 under the organisation Open Genome Informatics under the guidance of my mentors Yannick Wurm and Anurag Priyam.
Problem Statement: Performance and user centric improvements to Afra’s annotation editor.
Brief explanation: Gene prediction models are visually inspected and manually corrected for any mistakes. Curation of gene models is carried out on Afra, a crowdsourcing platform. Afra has two models - an annotation editor and a task processor. The annotation editor is build using JBrowse and WebApollo. This project focuses on migrating Afra to the latest JBrowse and getting a unit test suite ready to optimize the annotation editor and ease the learning curve of manual curation.
For migration of Afra, we took a look at JBrowse and Afra’s codebase. We examined the files added to Afra’s codebase, which exhibited the annotation editor functionality along with additional features of Afra. These differences in codebase were carried over to JBrowse as a plugin. This provided the annotation editor functionality which could be easily plugged into JBrowse or any other Genome browser.
Annotation editor uses Jasmine for unit tests. Jasmine was easy to setup and can be executed using a simple web server. I examined the already implemented tests for different functionalities of Afra. This provided me an insight into the processing of tests implemented in Afra. Further, I added some tests for the annotation editor functionality. These tests were:
Here is the link to my commits for the test suite.
We started building a plugin for the JBrowse to carry over the annotation editor functionality to JBrowse. This would provide Afra’s annotation edition functionality to use the latest JBrowse features, making the migration process smoother. The plugin development took place in the following steps :-
Now we had a basic plugin in place that had implemented the basic annotation editing functionality. (commit)
Now additional features of Afra had to be carried over. These feature are :-
Finally, we had successfully implemented the annotation editor of Afra as a plugin of JBrowse, along with extra features of Afra too. Further, to test whether all the annotation editing functionalities are working properly we implemented the test suite for the plugin.(commit)
August 29, 2017
The Evolution meeting took place this June in Portland, Oregon. This meeting is organised every year and is one of the largest conferences in our field.
Creating their schedule from an incredibly rich program of up to 13 parallel sessions, Natalia and Emeline caught up with the latest on social systems, genomics and evolutionary stories. They also participated in the conference:
The meeting was a wonderful occasion to catch up with collaborators and presenting the Wurmlab research to this (mainly) North American audience. Next year, Evolution will join ESEB in Montpellier (France), for a joint meeting.
July 10, 2017
The Royal Entomological Society’s Insect Genomics Interest Group meeting took place on May 16th at the Rothamsted Research Centre.
The speakers to this event included highly relevant researchers from all over the world, covering a wide range of subjects in insect genomics, from pest control to evolutionary biology. Of course the Wurmlab could not miss this chance and Carlos Martinez-Ruiz, Joe Colgan and Anurag Priyam attended the event to disseminate our research:
The meeting was very fruitful, and offered exciting opportunities for collaborations between the Wurmlab and other researchers from institutions in the UK and abroad.
We look forward to the next meeting!
May 16, 2017
Our paper on the genetic diversity of the fire ant social chromosomes has been accepted in Molecular Ecology and is now available online!
The fire ant social chromosomes carry a supergene that controls the number of queens in a colony. We describe a few features of this supergene system:
You can check out the press release, which covers some of the details about our work.
The full reference is: R Pracana, A Priyam, I Levantis, RA Nichols and Y Wurm. (2017) The fire ant social chromosome supergene variant Sb shows low diversity but high divergence from SB Molecular Ecology. DOI: 10.1111/mec.14054
February 21, 2017
Just a brief update to:
January 30, 2017
Google summer of code 2016 has just came to an end. Thanks to our host organisations Open Genome Informatics and Open Bioinformatics Foundation, we’ve had a productive summer with two excellent students. Both students wrote blog posts summarizing their work.
September 1, 2016
This post is going to be about my GSoC 2016 project under Open Genome Informatics organisation along with Anurag Priyam and Yannick Wurm as my mentors.
Problem statement: BLAST visualizations library for Bioruby and SequenceServer. Brief explanation: It is now trivial to generate large amounts of DNA sequence data; the challenge lies in making sense of the data. In many cases, researchers can gain important insights by comparing newly obtained sequences to previously known sequences using BLAST (>100,000 citations). This project will focus on creating a reusable visualizations library for BLAST results and integrating it with SequenceServer, a popular BLAST server. BLAST results are text based but lack rich visual representation. Having a visualizations can greatly facilitate interpretation of data.
So by this time I had played around the SequenceServer code enough to know how it was working and it was time to get down to the real part of the project. I started up with Length Distribution graph. It is a simple histogram representing hit sequences length frequency. The rectangles were colored using a grey scale, where the darker the shade the more significant the hit is. This graph provided user with an idea about the all the length of the hit sequence and the length of the query sequence when one hovers over the rectangles. It will also help user in annotations, identifying proteins across species, validating gene predictions, etc The graph was drawn using d3.js.
Next I started with Circos visualizations. Currently SequenceServer has Query against Hit to show alignment mapping between hit sequence and query sequence, Alignment overview to show alignment mapping between query sequence and all its hit sequences. Now Circos visualizations will add alignment mapping of multiple query and hit sequences to its arsenal. Circos visualizations is simple circos based graph with chords representing query sequence and hit sequence and ribbons represent the alignment mappings. The chords representing the query sequence is green in color and the others representing the hit sequence are blue in color. The ribbons are colored in red-yellow-blue color schema with red representing the most significant hit for a query and blue as least significant hit. One can hover over the ribbons to view its details such as the area this specific alignment covers on query or hit sequence, and the probability that this match was by chance. One can click on a chord to view its alignments. This was drawn using CircosJs.
Later I started with refinements of the previous graphs that SequenceServer provided. Now that we have four different visualizations and many of them use a lot of common code we decided to make the code modular, in order to make the code look better and to make adding new visualizations in the future, an much easier task and also to make changes in current ones easily. In Query against Hit the polygons were numbered alphabetical in order make it easier for user to understand which polygon corresponds to which alignment details provided below the graph.
For Alignment Overview I refactored the code to use ES6 modules, which is used by all the other visualizations too. I reduced the height of each graph so that at one user can view what options are being provided and then ahead accordingly. User can download the graphs in SVG or PNG formats.
My initial proposal was to add four new visualizations, but after detailed discussion with my mentors we decided that with the level of detailing required by one visualization we should limit ourselves to two.
Here is my list of commits
August 23, 2016
Congratulations to our 2016 Google Summer of Code students! We are pround & excited to host them:
Hiten Chowdhary (Indian Institue of Technology, Karaghpur) will create a BLAST result visualization methods for BioRuby and SequenceServer. This work should significantly facilitate the interpretation of results produced with our Sequenceserver custom BLAST-ing tool (see
Julian Mazzitelli (U Toronto) will improve Bionode’s capabilities for performing analyses of streams of biological data in real-time as they are downloaded, computed, or generated. This project is part of the Open Bioinformatics Foundation; supervision by Bruno Vieira, Max Ogden, Mathias Buus & Yannick.
April 25, 2016
Happy to announce that we now have a manuscript describing the rationale and current features of SequenceServer - our easy to setup BLAST frontend. Importantly, the manuscript also provides extensive detail about the sustainable software development and user-centric design approaches we used to build this software. The full bioRxiv reference is:
Sequenceserver: a modern graphical user interface for custom BLAST databases 2015. Priyam, Woodcroft, Rai, Munagala, Moghul, Ter, Gibbins, Moon, Leonard, Rumpf and Wurm. bioRxiv doi: 10.1101/033142 [PDF].
Be sure to check out the interactive figure giving a guided tour of Sequenceserver’s BLAST results.
Finally, I’ll note that Sequenceserver arose from our own needs; these are clearly shared by many as Sequenceserver has already been cited in ≥20 publications and has been downloaded ≥30,000 times! Thanks to all community members who have made this tool successful.
February 1, 2016
The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high performance computing.
This is exciting & empowering – in particular for small teams working on emerging model organisms that lacked genomic resources. But with great powers come great responsibilities… and risks of doing things wrong. These risks are far greater for genome biologists than, say physicists or astronomers who have strong traditions of working with large datasets. In particular:
Crucially, data analysis problems can be invisible: the analysis runs, the results seem biologically meaningful, and are wonderfully interpretable but they may in fact be completely wrong.
Geoffrey Chang’s story is an emblematic example. By the mid-2000s this young superstar professor crystallographer had won prestigious awards and published high-profile papers providing 3D-structures of important proteins. For example:
But in 2006, others independently obtained the 3D structure of an ortholog to one of those proteins. Surprisingly, the orthologous structure was essentially a mirror-image of Geoffrey Chang’s result.
Rigorously double-checking his scripts, Geoffrey Chang then realized that: “an in-house data reduction program introduced a change in sign [..,]”.
In other words, a simple +/- error led to plausible and highly publishable but dramatically flawed results. He retracted all five papers.
Devastating for him, for his career, for the people working with him, for the hundreds of scientists who based follow-up analyses and experiments on the flawed 3D structures, and for the taxpayers or foundations funding the research. A small but costly mistake.
A +/- sign mistake seems like it should be easily detectable. But how do you ensure that experiments requiring complicated data generation and complex analysis pipelines with interdependencies and sophisticated data structures yield correct results?
We can take inspiration from software developers in internet startups: similarly to academic researchers, they form small teams of qualified people to do great things with new technologies. Their approaches for making software robust can help us make our research robust.
An important premise is that humans make mistakes. Thus (almost) all analysis code includes mistakes (at least initially; this includes unix commands, R, perl/python/ruby/node scripts, et cetera). Increasing robustness of our analyses thus requires becoming better at detecting mistakes – but also ensuring that we make fewer mistakes in the first place. Many approaches exist for this. For example:
This list overlaps at least in part with what has been written elsewhere and my coursework material. In my lab we do our best to follow best practices for the bioinformatics tools we develop and our research on social evolution.
Additionally, the essentials of experimental design are long established: ensuring sufficient power, avoiding confounding factors & pseudoreplication (see above & elsewhere), and using appropriate statistics. Particular caution should be used with new technologies as they include sources of bias that may not be immediately obvious (e.g. Illumina lane, extraction order…).
There is no way around it: analysing large datasets is hard.
When genomics projects involved tens of millions of $, much of this went to teams of dedicated data scientists, statisticians and bioinformaticians who could ensure data quality and analysis rigor. As sequencing has gotten cheaper the challenges and costs have shifted even further towards data analysis. For large scale human resequencing projects this is well understood. Despite the challenges being even greater for organisms with only few genomic resources, surprisingly many PIs, researchers and funders focusing on such organisms suppose that individual researchers with little formal training will be able to perform all necessary analysis. This is worrying and suggests that important stakeholders who still have limited experience of large datasets underestimate how easily mistakes with major negative consequences occur and go undetected. We may have to see additional publication retractions for awareness of the risks to fully take hold.
Thankfully, multiple initiatives are improving visibility of the data challenges we face (e.g., 1, 2, 3, 4, 5, 6). Such visibility of the risks – and of how easy it is to implement practices that will improve research robustness – needs to grow among funders, researchers, PIs, journal editors and reviewers. This will ultimately bring more people to do better, more trustworthy science that will never need to be retracted.
This post came together thanks to the SSI Collaborations workshop, Bosco K Ho’s post on Geoffrey Chang, discussions in my lab and through interactions with colleagues at the social insect genomics conference and the NESCent Genome Curation group. YW is funded by the Biotechnology and Biological Sciences Research Council [BB/K004204/1], the Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and is a fellow of the Software Sustainablity Institute.
June 2, 2015