Ants, Bees, Genomes & Evolution

@ Queen Mary University London


Joe, Carlos and Priyam present at Insect Genomics meeting

The Royal Entomological Society’s Insect Genomics Interest Group meeting took place on May 16th at the Rothamsted Research Centre.

The speakers to this event included highly relevant researchers from all over the world, covering a wide range of subjects in insect genomics, from pest control to evolutionary biology. Of course the Wurmlab could not miss this chance and Carlos Martinez-Ruiz, Joe Colgan and Anurag Priyam attended the event to disseminate our research:

  • Joe gave a talk on the signatures of positive selection in the genomes of bumblebees (Bombus terrestris) from populations around the UK
  • Carlos presented the preliminary results of his PhD project on the signatures of evolutionary conflict in the transcriptome of the fire ant Solenopsis invicta
  • Priyam presented a poster explaining the benefits of using SequenceServer as a BLAST interface and using GeneValidator to determine quality of gene predictions.

The meeting was very fruitful, and offered exciting oportunities for collaborations between the Wurmlab and other researchers from institutions in the UK and abroad.

We look forward to the next meeting!

May 16, 2017



Paper on Social Chromosomes Accepted

Our paper on the genetic diversity of the fire ant social chromosomes has been accepted in Molecular Ecology and is now available online!

The fire ant social chromosomes carry a supergene that controls the number of queens in a colony. We describe a few features of this supergene system:

  • The two variants of the social chromosomes are differentiated from each other over the supergene region, but without any evidence of evolutionary strata.
  • There is a large number non-synonymous substitutions between the two variants.
  • The never recombining variant Sb is almost fixed in the North American population.

You can check out the press release, which covers some of the details about our work.

The full reference is: R Pracana, A Priyam, I Levantis, RA Nichols and Y Wurm. (2017) The fire ant social chromosome supergene variant Sb shows low diversity but high divergence from SB Molecular Ecology. DOI: 10.1111/mec.14054

February 21, 2017



Brief New Year's update

Just a brief update to:

January 30, 2017



Excellent work by GSoC Bioinformatics students

Google summer of code 2016 has just came to an end. Thanks to our host organisations Open Genome Informatics and Open Bioinformatics Foundation, we’ve had a productive summer with two excellent students. Both students wrote blog posts summarizing their work.

As the finishing touches are implemented, we look forward to being able to deploy the work of these students into production releases of SequenceServer and Bionode.

September 1, 2016



Blast Visualization Google Summer of Code

Written by Hiten Chowdhary, cross-posted from http://www.hiten.io/blog/articles/gsoc-16/

This post is going to be about my GSoC 2016 project under Open Genome Informatics organisation along with Anurag Priyam and Yannick Wurm as my mentors.

About Project

Problem statement: BLAST visualizations library for Bioruby and SequenceServer. Brief explanation: It is now trivial to generate large amounts of DNA sequence data; the challenge lies in making sense of the data. In many cases, researchers can gain important insights by comparing newly obtained sequences to previously known sequences using BLAST (>100,000 citations). This project will focus on creating a reusable visualizations library for BLAST results and integrating it with SequenceServer, a popular BLAST server. BLAST results are text based but lack rich visual representation. Having a visualizations can greatly facilitate interpretation of data.

Warming Up

Before the project has started I was fairly acquainted with Ruby and Javascript. So I started with small bug fixes in order to get acquainted with the SequenceServer code. SequenceServer provided support for downloading the report generated in XML or TSV format. When one clicks the download button, it would generate the files and store it as tmp file until it is completely downloaded. But this process was repeated every time the download button was clicked, so we decided to save the tmp files generated in two formats, so if user needs it again no need to generate the file instead directly start the download. I also played around with some error handling issues just to get comfortable with the ruby part of the project. I helped improving the XML parsing of files and check for integrity issues and cases when a specific report user was searching and not found. I added checks in various places and raised appropriate error messages to help user figure out what was going wrong.

Visualizations

So by this time I had played around the SequenceServer code enough to know how it was working and it was time to get down to the real part of the project. I started up with Length Distribution graph. It is a simple histogram representing hit sequences length frequency. The rectangles were colored using a grey scale, where the darker the shade the more significant the hit is. This graph provided user with an idea about the all the length of the hit sequence and the length of the query sequence when one hovers over the rectangles. It will also help user in annotations, identifying proteins across species, validating gene predictions, etc The graph was drawn using d3.js.

length-distribution length-distribution-hover

Next I started with Circos visualizations. Currently SequenceServer has Query against Hit to show alignment mapping between hit sequence and query sequence, Alignment overview to show alignment mapping between query sequence and all its hit sequences. Now Circos visualizations will add alignment mapping of multiple query and hit sequences to its arsenal. Circos visualizations is simple circos based graph with chords representing query sequence and hit sequence and ribbons represent the alignment mappings. The chords representing the query sequence is green in color and the others representing the hit sequence are blue in color. The ribbons are colored in red-yellow-blue color schema with red representing the most significant hit for a query and blue as least significant hit. One can hover over the ribbons to view its details such as the area this specific alignment covers on query or hit sequence, and the probability that this match was by chance. One can click on a chord to view its alignments. This was drawn using CircosJs.

circos
circos-select circos-hover

Later I started with refinements of the previous graphs that SequenceServer provided. Now that we have four different visualizations and many of them use a lot of common code we decided to make the code modular, in order to make the code look better and to make adding new visualizations in the future, an much easier task and also to make changes in current ones easily. In Query against Hit the polygons were numbered alphabetical in order make it easier for user to understand which polygon corresponds to which alignment details provided below the graph.

Query against Hit

For Alignment Overview I refactored the code to use ES6 modules, which is used by all the other visualizations too. I reduced the height of each graph so that at one user can view what options are being provided and then ahead accordingly. User can download the graphs in SVG or PNG formats.

alignment-overview alignment-overview-hover

My initial proposal was to add four new visualizations, but after detailed discussion with my mentors we decided that with the level of detailing required by one visualization we should limit ourselves to two.

Here is my list of commits

August 23, 2016



Google Summer of Code 2016

Congratulations to our 2016 Google Summer of Code students! We are pround & excited to host them:

April 25, 2016



Sequenceserver manuscript

Happy to announce that we now have a manuscript describing the rationale and current features of SequenceServer - our easy to setup BLAST frontend. Importantly, the manuscript also provides extensive detail about the sustainable software development and user-centric design approaches we used to build this software. The full bioRxiv reference is:

Sequenceserver: a modern graphical user interface for custom BLAST databases 2015. Priyam, Woodcroft, Rai, Munagala, Moghul, Ter, Gibbins, Moon, Leonard, Rumpf and Wurm. bioRxiv doi: 10.1101/033142 [PDF].

Be sure to check out the interactive figure giving a guided tour of Sequenceserver’s BLAST results.

Finally, I’ll note that Sequenceserver arose from our own needs; these are clearly shared by many as Sequenceserver has already been cited in ≥20 publications and has been downloaded ≥30,000 times! Thanks to all community members who have made this tool successful.

February 1, 2016



Avoid having to retract your genomics analysis.

Please cite the Winnower version of this article

2016 Edit: slides are from a 15 minute talk representative of this blog at Popgroup december 2015 in Edinburgh:

Biology is a data-science

The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high performance computing.

This is exciting & empowering – in particular for small teams working on emerging model organisms that lacked genomic resources. But with great powers come great responsibilities… and risks of doing things wrong. These risks are far greater for genome biologists than, say physicists or astronomers who have strong traditions of working with large datasets. In particular:

  • biologist researchers generally learn data handling skills ad hoc with little knowledge of best practices;
  • PIs – having never themselves handled huge datasets – have difficulties critically evaluating the data and approaches;
  • new data are often messy with no standard analysis approach; even so-called “standard” analysis methodologies generally remain young or approximative;
  • analyses intending to identify biologically interesting patterns (e.g., genome scans for positive selection, GO/gene set enrichment analyses) will enrich for technical artifacts and underlying biases in the data;
  • data generation protocols are immature & include hidden biases leading to confounding factors (when things you are comparing differ not only according to the trait of interest but also in how they were prepared) or pseudoreplication (when one independent measurement is considered as multiple measurements).

Invisible mistakes can be costly

Crucially, data analysis problems can be invisible: the analysis runs, the results seem biologically meaningful, and are wonderfully interpretable but they may in fact be completely wrong.

Geoffrey Chang’s story is an emblematic example. By the mid-2000s this young superstar professor crystallographer had won prestigious awards and published high-profile papers providing 3D-structures of important proteins. For example:

  • Science (2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters.
  • Journal of Molecular Biology (2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation.
  • Science (2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide.
  • Science (2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate. 310:1950-1953.
  • PNAS (2004) Ma & Chang. Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli.

But in 2006, others independently obtained the 3D structure of an ortholog to one of those proteins. Surprisingly, the orthologous structure was essentially a mirror-image of Geoffrey Chang’s result.

Rigorously double-checking his scripts, Geoffrey Chang then realized that: “an in-house data reduction program introduced a change in sign [..,]”.

In other words, a simple +/- error led to plausible and highly publishable but dramatically flawed results. He retracted all five papers.

Devastating for him, for his career, for the people working with him, for the hundreds of scientists who based follow-up analyses and experiments on the flawed 3D structures, and for the taxpayers or foundations funding the research. A small but costly mistake.

Approaches to limit the risks

A +/- sign mistake seems like it should be easily detectable. But how do you ensure that experiments requiring complicated data generation and complex analysis pipelines with interdependencies and sophisticated data structures yield correct results?

We can take inspiration from software developers in internet startups: similarly to academic researchers, they form small teams of qualified people to do great things with new technologies. Their approaches for making software robust can help us make our research robust.

An important premise is that humans make mistakes. Thus (almost) all analysis code includes mistakes (at least initially; this includes unix commands, R, perl/python/ruby/node scripts, et cetera). Increasing robustness of our analyses thus requires becoming better at detecting mistakes – but also ensuring that we make fewer mistakes in the first place. Many approaches exist for this. For example:

  • Every additional chunk of code can contain additional mistakes. Write less code, you’ll make fewer mistakes. For this we should try to reuse our own code and that of others (e.g., by using bio* libraries).
  • Every subset/function/method of every piece of code should be tested on fake data (edge cases) to ensure that results are as expected (see unit and integration testing). It can be defendable to write the fake datasets and tests even before writing analysis code.
  • Continuous integration involves tests being automatically rerun (almost) instantly whenever a change is made anywhere in the analysis. This helps detect errors rapidly before performing full analyses.
  • Style guides define formatting and variable naming conventions (e.g., for ruby or R). Respecting one makes it easier for you to go back over your analysis two years later (e.g., for paper revisions or a new collaboration); and for others to reuse and improve it. Tools can automatically test whether your code is in line with the style guide (e.g., RLint, Rubocop, PyLint).
  • Rigorously tracking data and software versions and sticking to them reduces risks of unseen incompatibilities or inconsistencies. A standardized project structure can help.
  • Code reviews: having others look through your code – by showing it to them in person, or by making it open source – helps to learn how to improve code structure, to detect mistakes and to ensure that our code will be reusable by ourselves and others.
  • There are specialists who have years of experience in preventing and detecting mistakes in code or analyses. We should hire them.
  • Having people independently reproduce analyses using independent laboratory and computational techniques on independently obtained samples might be the best validation overall…

This list overlaps at least in part with what has been written elsewhere and my coursework material. In my lab we do our best to follow best practices for the bioinformatics tools we develop and our research on social evolution.

Additionally, the essentials of experimental design are long established: ensuring sufficient power, avoiding confounding factors & pseudoreplication (see above & elsewhere), and using appropriate statistics. Particular caution should be used with new technologies as they include sources of bias that may not be immediately obvious (e.g. Illumina lane, extraction order…).

There is hope

There is no way around it: analysing large datasets is hard.

When genomics projects involved tens of millions of $, much of this went to teams of dedicated data scientists, statisticians and bioinformaticians who could ensure data quality and analysis rigor. As sequencing has gotten cheaper the challenges and costs have shifted even further towards data analysis. For large scale human resequencing projects this is well understood. Despite the challenges being even greater for organisms with only few genomic resources, surprisingly many PIs, researchers and funders focusing on such organisms suppose that individual researchers with little formal training will be able to perform all necessary analysis. This is worrying and suggests that important stakeholders who still have limited experience of large datasets underestimate how easily mistakes with major negative consequences occur and go undetected. We may have to see additional publication retractions for awareness of the risks to fully take hold.

Thankfully, multiple initiatives are improving visibility of the data challenges we face (e.g., 1, 2, 3, 4, 5, 6). Such visibility of the risks – and of how easy it is to implement practices that will improve research robustness – needs to grow among funders, researchers, PIs, journal editors and reviewers. This will ultimately bring more people to do better, more trustworthy science that will never need to be retracted.

Acknowledgements

This post came together thanks to the SSI Collaborations workshop, Bosco K Ho’s post on Geoffrey Chang, discussions in my lab and through interactions with colleagues at the social insect genomics conference and the NESCent Genome Curation group. YW is funded by the Biotechnology and Biological Sciences Research Council [BB/K004204/1], the Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and is a fellow of the Software Sustainablity Institute.

Please cite The Winnower version of this article

June 2, 2015



Recruiting genome hacker/bioinformatician.

Our department is recruiting a bioinformatician/genomicist. Apply by March 29th.

Bioinformatician animation ad

March 16, 2015



Scientific writing

Scientific writing isn’t poetry. Our aim is to communicate information clearly - not to be the next Proust.

Below are some thoughts about scientific writing that have helped students & myself. The list isn’t exhaustive:

Structure & Content:

  • In scientific writing your introduction should begin with the general context of the topic and end by “announcing” the structure of what will come next.
  • In scientific writing you want one idea/point/aim per paragraph.
  • Within each paragraph, use a simple and clear structure. For example: Four lines of evidence suggest that bblueblablamyaimis. First, blablabla. Second, blablablablabla. Third, blablablbalbala. Finally, blabalbalbalbalbla.
  • If consecutive sentences are logically related and can have similar structure - use similar structure. No need.
  • In any writing you need to guide the reader through a clear thought process
  • No matter how interesting an anecdote/definition/fact may be, take care not to stray far from the aim of the point you’re making/topic of the essay.
  • If you’re unsure how to structure your article/essay… download some that others have already written!

Implementation:

  • Please respect the style guidelines given by Strunk & White’s “The Elements of Style”. Keep it concise.
  • Pay attention to detail (punctuation, spelling, consistency of formatting, species in italics…). Neglect of such simple things makes the reader think you do nothing properly.
  • Don’t Randomly capitalize Words that Aren’t normally Capitalized.
  • Most word processors have spell-checkers. And grammar-checkers. Use them. (Make sure that all rules are activated - e.g. in MS word you need to manually activate the strict grammar rules). Neglecting to use such tools can feel insulting to the reader.
  • KISS: Keep It Simple Stupid. If a word, sentence or piece of information is unnecessary, remove it! Otherwise, you complicate things for the reader.
  • Again, use simple words, simple clear sentences. Scientific writing isn’t Shakespeare or Voltaire.
  • If you decide to use funky formatting (columns, photos, sophisticated fonts, etc.) - don’t mess it up! Better to be plain and functional (Helvetica, single column…) than have something that print incorrectly or is illegible.
  • When handing something in electronically, do it as a PDF (appearance shouldn’t change between computers…).

Iterating:

  • When you think you’re done, print it out & let it sit for 36 hours. Then read it very critically as if it were the work of someone else. Consider each word, each sentence, each paragraph, the overall structure. What is the aim of each element? Is it working to its full effect? If you identify issues, fix them. Repeat until you think its perfect. Now ask your roommate/parent/partner/friend to read it critically.
  • You may disagree with the exact feedback/suggestion someone gives you. But the fact they highlighted/raised an issue is sufficient to indicate that you need to change something. Do something about it.
  • Any writing requires writing, editing and rewriting many times before obtaining a satisfactory result (the need for this reduces as you get older/more experienced).

New Scientist-type articles have additional challenges:

  • For example articles, check the New Scientist website. You can only understand how such articles should be written if you critically read a bunch.
  • Who is the audience for such an article?
  • Keep in mind that you’ll want an appropriate catchy title.
  • Use an appropriate writing style (this may be less formal than normal scientific writing).
  • How do you make your article attractive? Interesting to read?

Additional resources:

February 5, 2015


All Posts >>