Ants, Bees, Genomes & Evolution

@ Queen Mary University London

IUSSI conference talk: Better analyses for social insect genomics

October 9, 2018

Social insect biology is now a data science!

I (Yannick) spent the week of August 5th at the 18th Congress of the International Society for the Study of Social Insects in Guarujá, Brazil. This is a big quadrennial conference uniting researchers from around the world who study ants, bees, wasps, termites and a few other animals.

Part of my trip was funded by the Software Sustainability Institute which lobbies for and helps people do better research through improving software. Hence this blog post.

The study of social insects has traditionally used approaches including behavioral observation and taxonomic sampling, with genetic analyses becoming more common since the mid 2000s. A pleasant surprise at the conference was the recent increase in highly molecular, genome-wide approaches where whole or partial genomes or transcriptome sequences of many individuals are obtained in order to make specific comparisons within species, or sometimes also between species.

This disruptive shift is largely due to the 50,000-fold drop in DNA sequencing costs over the past 10 years. See Émeline’s recent review on the genes and processes underpinning evolution of social behavior in ants.

With great power comes great responsibilities.

A major challenge for small research labs now wielding in large genomic datasets is that it is easy to make a small mistake that has high costs.

In light of this, as part of a workshop on genomics approaches organised with Tim Linksvayer and Alex Mikheyev, I gave an overview of some of the lessons we can transfer from the worlds of “other” data sciences to our expanding world of social insect genomics. This includes:

  • writing analysis code for humans;
  • respecting style guides for code (e.g., R style guide), and for how to structure a genomic analysis;
  • benefits of peer-reviewing code, and of peer-coding sessions;
  • using specific tools that increase productivity while decreasing risks (rmarkdown, fat machines, snakemake/nextflow);
  • benefits of visualising data in many different manners. Typically when people learn to do basic linear models they learn the importance of visually inspecting some plots (e.g. qqplot, residuals). But when we end up performing tens of thousands of such analyses (e.g. one for each gene or one for each SNP), many forgo doing this.

My slides are here:

It is worth highlighting three additional, important points raised during the congress that have more to do with interpretation, vocabulary and experimental design than anything technical:

  1. There is occasional misconception/mislabeling that extant species may be representative of species that lived in the past. No: just as much time has passed since the most recent common ancestor of all ants and Pheidole pallidula ants as passed since the most recent common ancestor of all ants and any particular Harpegnathos saltator. Similarly, no current species of great ape is “more similar” to any ancestor of humans - all are equally similar to their shared common ancestor.
  2. The definition of eusocial has become too fuzzy to be useful. Superorganismality is a much more precise and relevant concept that clearly identifies irreversible evolutionary transitions from context‐dependent reproductive altruism to unconditional differentiation of permanently unmated castes. See also Koos’ paper Superorganismality and caste differentiation as points of no return: how the major evolutionary transitions were lost in translation.
  3. Comparisons (e.g. of genome content) between two species are often confounded by many differences other than the first two that come to mind (ecology, lifespan, environment, demographic history etc…).

A fun and highly stimulating conference.

Project structures for genomics analyses

October 1, 2018

How do you structure your files and folders for genomics analyses?

One challenge is that many analyses actually require multiple steps, thus having all steps in one place becomes a mess.

So we should structure our analyses across multiple folders. But how should we name them and keep track of their order?

Another (key) challenge in performing genomics analyses is that we often have to perform analyses multiple times.

  • we need to try three different approaches because we don’t know which will perform best;
  • or we want to try a new version of the analysis software;
  • or we want to start with a small “test” dataset before scaling up to the full data;
  • or we want to redo everything on a completly different dataset;
  • or a reviewer asks for a minor adjustment in analysis or an additional plot on the data we analyzed months/years ago.

So how do we keep track of the different steps and versions of analyses?

The standard approach we use for all projects in the lab is derived from ideas initially proposed by William Noble in A Quick Guide to Organizing Computational Biology Projects. That initial model has been adjusted based on our experience of dozens of projects over the years, as well as discussions with Julien Roux, Anurag Priyam, and Roddy Pracana.

Stable link here.

Best to just illustrate with an example of how this works at the simplest level.


├── input
│   ├── 2016-04-14-bombus_raw_28_samples
│   │   ├── sample1.fq    #  could link to /data/SBCS-WurmLab/archive/db/genomic/reads/...                 
│   │   ├── sample2.fq 
│   │   ├── sample3.fq
│   │   ├── bombus_genome.fa -> ~/db/genomic/B_terrestris/Bter20110317-genome.fa
│   │   └──  # list of ln -s, cp or wget/curl commands 
│   └── 2016-04-16-cleaned_reads
│       ├── sample1.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample1.clean.fq.gz
│       ├── sample2.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample2.clean.fq.gz
│       ├── sample3.fq.gz   -> ../../results/2016-04-14-read_cleaning/results/sample3.clean.fq.gz
│       └──  # just the ln -s commands.
├── results
│   ├── 2016-04-14-read_cleaning
│   │   ├── input        -> ../../input/2016-04-14-bombus_raw_28_samples
│   │   ├── results                                # only few files here
│   │   ├── sratoolkit   -> ../../soft/sratoolkit-2.4.2/bin/
│   │   ├── tmp                                    # use real scratch dir if more appropriate
|   |   ├──                         # if any particular software, modules or containers need to be loaded
│   │   └── WHATIDID.txt                           # or equivalent .sh or .Rmd (or knitr/jupyter)
│   ├── 2016-04-16-mapping_to_reference
│   │   ├── input        -> ../../input/2016-04-16-cleaned_reads
│   │   ├── results                                # only few files here
│   │   ├── tmp                                    # use real scratch dir if more appropriate
|   |   ├──                         # if any particular software, modules or containers need to be loaded
│   │   └── WHATIDID.txt                           # or equivalent .sh or .Rmd (or knitr/jupyter)
│   └── WHATIDID.txt                               # for overall rationale
└── soft
    ├── sratoolkit-2.4.2                           # if installed locally
    ├── bwa              -> /share/apps/sbcs/bwa/0.6.2/bin/bwa
    └── # links to other software if needed

Explicit (partial) conventions

Conventions include:

  • key directory names begin with YYYY-MM-DD date, followed by _underscore_delimited description; For example, a new project starting today should begin as follows: 2018-10-10-a_self_explanatory_name;
  • all subdirectory names should be self-explanatory;
  • link to files when appropriate. this can save tons of space AND reduce ambiguity/risks;
  • every results dir should contain a link named input to an input directory with a self explanatory name;
  • every directory in which you did something should contain a WHATIDID.txt (or an equivalent ruby/perl/jupyter/R/knitR/Sweave/Rmarkdown script) that contains all relevant commands. required to get from input to results;
  • once you have created an “input” (i.e. “data”) folder, make it read-only because you don’t want any accidental edits while you are running your analysis.`

Open PhD studentship: Data science & machine learning for genomic analysis

July 5, 2018

Interested in supercharging the productivity of genome biologist researchers?

We have an exciting 4-year bioinformatics PhD position open through the London BBSRC LiDO Doctoral Training Programme.

Apply by 5pm July 20th here at LIDO to start in September.

A description of the project is below. It is highly interdisciplinary - no need to already be able to understand all the details today.

Great candidates fulfill 3 of the following four criteria: smart, hard working, understands genomes, and not scared of data analysis or coding.

If you have any questions regarding scope or nature of the project, or whether your skills are potentially sufficient, please don’t hesitate to get in touch with me (Yannick).

(Standard UKRI eligibility criteria apply (i.e. I think one must be UK resident). - the LiDO people can explain this better).

Project summary

(apologies for the use of domain-specific jargon!)

Inferring gene function for emerging model organisms

The first generation of molecular-genetic research focused on traditional model organisms including mouse, yeast, zebrafish, Drosophila, and C. elegans. Genetic research increasingly uses diverse organisms that are much more relevant models for specific questions. For example, some such emerging organisms exhibit unique phenotypes including 100-fold intra-specific variation in lifespan, resistance to harsh environmental conditions, represent novel animal models for disease or development, provide crucial ecosystem services, or are key to food security because they are crops or may pollinate them.

A major challenge when working with such “emerging” model organisms is making sense of the “gene lists” that result from genome-wide analyses (e.g., of gene expression or genome-wide associations).

Here, we will develop a bioinformatics tool that takes a list of genes or genomic locations from a new species as input, and transparently produces relevant functional information describing this list of loci. When presented with data for which no direct information exists, the tool will in a first instance identify relationships of orthology to regions of other species. This will create a trail of links to databases in which functional information for orthologous regions does exist. These databases will be interrogated following hierarchical set of rules (initially defined based on human-curated examples). Using using cutting-edge “learning to rank” machine learning techniques the rulesets will be refined over time by tracking user behaviour (based on logs of which relationships/trails users retain) as well as explicitly allowing users to flag issues. The tool hereby makes it possible to extract significant value from largescale datasets that would otherwise require laborious case-by-case engineering efforts to connect. Summary data will be returned to the user using visualisations, statistics and tables in a manner that facilitates interpretation. Inferences and relationship calculations taking seconds will be available immediately; those taking minutes (e.g., distant orthology) will appear asynchronously as they complete; and those taking longer will result in email notification.

We will package our work in a manner that makes it accessible to biologists working with new or existing genomes. This builds on our extensive success with including with the SequenceServer and OMA software. Overall, our approach will substantially improve the ability of genome biologists to generate meaningful biological insight when working with new organisms.

This project is in collaboration with Christophe Dessimoz at UCL/Lausanne.

Posted to bioRxiv: Degenerative Expansion of a Young Supergene

May 23, 2018

We have just posted a new manuscript to bioRxiv, where we describe the structural differences between the SB and Sb versions of the fire ant social chromosome pair.

We find that Sb is larger than SB and discuss how the suppression of recombination of Sb would lead to this type of ‘degenerative expansion’, as hypothesised for Y chromosomes and other non-recombining chromosomal regions. Read the manuscript, and tell us what you think!

The Wurmlab at the Tower Hamlets Festival of Communities

May 18, 2018

Fantastic Minibeasts and where to find them

This year the Wurm Lab were out in force at the annual Festival of Communities held in Stepney Green Park on Saturday 12 May. We brought some of our lab colonies of social invertebrates to demonstrate differences in social complexity including: Messor barbarus ants, Bombus terrestris bumblebees, and Stegodyphus dumicola social spiders. We had a fantastic time talking to the public about our research, taking groups around the park on minibeast safaris, and running a kids’ craft table where solitary bee hotels made from recycled drinks cartons were beautifully decorated. It was an immensely enjoyable day meeting people from the local community and some very enthusiastic and inspiring future scientists.

Keeping up with reading newly published articles

February 20, 2018

What is it? We called it the Index, it is our monthly reading review, tailored to our areas of interest and research.

Why do you do it? Because there are so many articles being published it can be quite difficult to keep up with reading new material. We want to increase our general knowledge of important questions, techniques and discoveries in our wider research areas while decreasing the likelihood of missing any relevant “key papers”. We are also keen to efficiently help each other out by sharing our readings. Finally, we are getting to read more broadly - including topics outside our comfort zone.

How does it work? Each month, each Wurmlab member gets a list of three journals to review from our own journal generator. The script is on Github here. Three articles that are relevant to the group or a particular lab member are picked form each journal’s table of contents. These articles are added to a document that anyone in the lab group can read at any time.

Great, where can I see an example? See the last instalment of the Index here.

Brazil! IUSSI symposium on the evolution of social organization

February 15, 2018

Join us in Guarujá!

We (Emeline, Carlos & Yannick) are excited to host a symposium on the evolution of social organisation at the upcoming IUSSI conference.

We welcome a diversity of approaches and study systems. If you’re unsure about the relevance of your work, don’t hesitate to get in touch.

Full symposium title and abstract below:

Evolution of social organization

How an insect society is organized varies tremendously between but also within species and populations. Such diversity includes variation in numbers of reproductive individuals, modes of reproduction and of dispersal, relationships with neighboring colonies, degrees of morphological and behavioral caste specialization, and interactions (mutualistic, parasitic, predatory…) with closely or distantly related species.

Understanding how and when changes in social lifestyle occur is central to the study of social evolution. More specifically, can we measure the evolutionary pressures involved in changes of social organisation? Are particular ecological conditions involved? Can molecular, genetic or physiological features constrain or facilitate social evolution? What are the effects of a change in social interactions on how natural selection can act?

Encompassing the complexities of such multifaceted topics requires interdisciplinary discussion. This symposium will thus include both theoretical and empirical research addressing the topic from a variety of scales and angles.

Easy mistake comparing numbers in R

January 3, 2018

It shouldn’t really be necessary to share this. But it keeps popping up (in particular when students are learning to program in R).

It is nonsensical to compare text to numbers. But R will let you. For example:

> "a" > 1
[1] TRUE
> "A" > 1
[1] TRUE
> "bob" > 99
[1] TRUE

I suspect that this is a remnant of the desire to be able to compare letters (e.g., for sorting ASCII characters alphabetically “A” < “D”).

Problem comparing numbers and “text as numbers” in R

When new to programming in R, you might ask the user to input a number. For example with

input_number <- readline(prompt="Enter a number: ")

This comes into R as text, not as a number. Unless you run as.numeric() or strtoi() on the input, it will remain text. Comparisons are then equivalent to the following

> 1 == "10"
> 10 == "10"
[1] TRUE
> 100 == "10"
> 11 < "10"
> 1 < "10"
[1] TRUE
> 9 < "10"   # <- uh-oh
> 3 < "10"

Note that some of these comparisons yield the result that you would expect if only comparing numbers. But some of them (e.g., the last 2) give you an incorrect response. This is dangerous.

R does not show an error message; it acts as is everything is ok, and gives you a response that looks reasonable (i.e., TRUE or FALSE). It would thus be easy for this type of mistake to go undetected.

How can this type of number comparison problem in R be prevented or detected?

  • Always check/force the type. If you run as.numeric() or strtoi on something text that is not unambiguously a number, R will show an error message or return NA.
  • Use a testing framework such as testtthat to ensure that even the small pieces of code you write behave as you would expect.

Brief end of year update - publications and presentations

December 27, 2017

Despite the lack of updates, things have been happening!

Analyzing the genomes of all the ants

After our discovery of extremely low genetic diversity in the Sb variant of the fire ant social chromosome, we asked what is going on with Gp-9? This is the original odorant binding protein that Ken Ross found - using starch gel electrophoresis - to be fully associated with single vs multiple-queen colonies.

The patterns described back then still hold, but surprisingly we find that nine additional Gp-9-like odorant binding protein genes are in the social chromosome region of suppressed recombination. The descriptions of all 23 fire ant Odorant Binding Proteins, their expression profiles, and the differences between social chromosome alleles are detailed in Evolution Letters:

In a different paper, we review findings from the most recent analyses of the 23 ant genomes and 38 ant transcriptomes.

Conference presentations

Lab members had the opportunity to present at many occassions, including:

  • Roddy Pracana presenting at the London Epigenetics Club meeting on Gene-Environment interactions in non-traditional model organisms.
  • Recently finished MSc student Gino Brignoli presenting his Lasius flavus at the Northwest europe IUSSI meeting in York
  • Current BBSRC Lido rotation student Isabel Fletcher showing the impact of pesticide on gene expression in bumblebees at the Northwest europe IUSSI meeting in York
  • Yannick’s giving a Keynote at the annual meeting of the french section of society for the study of social insects, as well as presenting at ENS Lyon, the London Next Generation Sequencing Congress and the joint Genome 10k and Genome Sciences conference in Norwich. The latter was elegantly Illustrated by Sanger’s Alex Cagan:

Degrees degrees degrees

Finally, we congratulate Gino Brignoli & Abdoulie Kanteh for the high marks achieved for their MSc disserations, and Roddy Pracana for submitting his PhD:

Performance and user centric updates to Afra's annotation editor

August 29, 2017

Written by Hiten Chowdhary, cross-posted from ([]

This is a summary of the work done during this summer as part of Google Summer of Code 2017 under the organisation Open Genome Informatics under the guidance of my mentors Yannick Wurm and Anurag Priyam.

About the project

Problem Statement: Performance and user centric improvements to Afra’s annotation editor.

Brief explanation: Gene prediction models are visually inspected and manually corrected for any mistakes. Curation of gene models is carried out on Afra, a crowdsourcing platform. Afra has two models - an annotation editor and a task processor. The annotation editor is build using JBrowse and WebApollo. This project focuses on migrating Afra to the latest JBrowse and getting a unit test suite ready to optimize the annotation editor and ease the learning curve of manual curation.

Project Summary

Before the project started, I was fairly acquainted with javascript and ruby. Initially, I got familiar with JBrowse’s codebase and understood the upgrades along with changes made in different versions. We planned to get the test suite ready before the migration of Afra takes place. With the test suite in place, the migration would take place without any hassle. One can run tests to ensure annotation editing functionality is working properly.

For migration of Afra, we took a look at JBrowse and Afra’s codebase. We examined the files added to Afra’s codebase, which exhibited the annotation editor functionality along with additional features of Afra. These differences in codebase were carried over to JBrowse as a plugin. This provided the annotation editor functionality which could be easily plugged into JBrowse or any other Genome browser.

Test Suite

Annotation editor uses Jasmine for unit tests. Jasmine was easy to setup and can be executed using a simple web server. I examined the already implemented tests for different functionalities of Afra. This provided me an insight into the processing of tests implemented in Afra. Further, I added some tests for the annotation editor functionality. These tests were:

  • Flip strand : Checks whether the annotation editor is correctly changing the genome’s strand.
  • Set longest ORF: Checks whether annotation editor adequately implements the longest open reading frame for a given transcript.
  • Mark non canonical splice sites: Checks whether the annotation editor is duly marking the non canonical splice sites for the transcript.
  • Mark non canonical translation start site: Checks whether the annotation editor is duly marking the non canonical start translation sites for the transcript.
  • Mark non canonical translation stop site: Checks whether the annotation editor is duly marking the non canonical splice sites for the transcript.
  • Filter features: Checks whether the annotation editor is properly filtering a given type of transcript (say exons) from a given set of transcripts.
  • Copy feature: Checks whether annotation editor is making an exact copy of features passed to it.
  • Merge exon: Checks whether annotation editor suitably merges two given exons and returns a new transcript.
  • Create transcript: Checks whether annotation editor creates a new transcript using Simple feature, for the provided features.
  • Resize exon: Checks whether annotation editor adequetely resizes the exons.
  • Get CDNA coordinates: Checks whether the annotation editor provides correct cdna coordinates for a given transcript.
  • Delete exon: Checks whether annotation editor properly deletes a exon from a given transcript.

Here is the link to my commits for the test suite.

Migrating Afra to latest JBrowse

We started building a plugin for the JBrowse to carry over the annotation editor functionality to JBrowse. This would provide Afra’s annotation edition functionality to use the latest JBrowse features, making the migration process smoother. The plugin development took place in the following steps :-

  • Edit track implementations : First task was adding a new track to JBrowse’s browser for plugin’s initialization.
  • Adding drag and drop feature : Next task was including proper jquery scripts and modifying the properties of existing tracks, so that they can be dragged along the y axis and dropped to edit track for annotating them.
  • Edit track drop capabilities : Subsequently, we initialized the edit track and made it capable to accept the incoming features and make them available for annotation editor for further modifications.
  • Accessing annotation editor’s functionality : Now that we could successfully dragged and dropped a feature in edit track, it was time to use the annotation editor functionality to make required changes to the track. This list of functionality is available upon right clicking the feature in edit track.
  • Getting right click menu items working: By this stage we had got a front end in place to access the annotation editors functions. Now it was time to get those functionality such as get sequence, send to Gene Validator, resize exon, etc ready.

Now we had a basic plugin in place that had implemented the basic annotation editing functionality. (commit)

Now additional features of Afra had to be carried over. These feature are :-

  • We implemented triple click on track to zoom to base pair level directly. (commit)
  • Following this, we replaced the JBrowse’s reference sequence track with Afra’s. Now, when the feature in edit track is selected, the corresponding track in reference sequence gets highlighted. (commit)
  • We added a feature to view the residues (actual genomic sequence ‘atgc’) of the feature in edit track when it is selected. (commit)
  • We implemented an additional feature to validate the feature dropped in the edit track. We checked for non canonical translation start and stop sites and marked them with orange exclamation marks. This will help the user easily look for non canonical splice sites without looking into the sequence of the feature. (commit)

Finally, we had successfully implemented the annotation editor of Afra as a plugin of JBrowse, along with extra features of Afra too. Further, to test whether all the annotation editing functionalities are working properly we implemented the test suite for the plugin.(commit)

All posts

Scientific writing (2015/02)
Reference Letters (2014/01)