Using a git-based datastore for community curated phylogenies



Emily Jane McTavish and Mark T. Holder
University of Kansas
iEvoBio, June 2014

Presenter Notes

Notes go here

Open Tree of Life

Presenter Notes

The current (as of May) data set

Community contributed phylogenies

  • 6745 trees from 2914 published studies
  • 1188 trees from 991 studies partly curated
  • 335 trees from 327 studies completely curated and included in the synthetic tree.

Smith, Cranston et al. Submitted

Presenter Notes

The problem:

  • Large data set: Thousands of phylogenies, and always growing (hopefully!)
  • Each phylogeny requires some hand curation, often by multiple people
  • Need to be readily accessible, and editable by interested researchers


Smith, Cranston et al. Submitted

Presenter Notes

Curation

Presenter Notes

Potential data store options:

  • SQL database
  • Mongo, couchDB
  • git/github

Presenter Notes

We chose git!

  • Trees and annotations by study in Nexson format
    (JSON serialization of NeXML)
  • Whole datastore is a git repo!

Presenter Notes

Curation

  • Work in progress branch is created upon curation
  • If study hasn't been edited by someone else, changes are automatically merged.
  • Otherwise, merged changes are returned to curator to accept or reject
  • Updates pushed to GitHub after each commit

Features

  • Tracking curation attribution
  • Some subjective choices, edits made by many in the community over time

Presenter Notes

Curation

Presenter Notes

Features

  • These trees are the backend for OpenTree showpiece

the synthetic tree!

  • but also a useful datastore for other researchers
  • Repo is hosted on GitHub, entire data store can be easily cloned and updated
  • Anyone can easily download all the data!

Presenter Notes

Features

  • Hosting on Github
  • Free
  • Familar to many in the field

Presenter Notes

Potential issues:

  • Phylogenies are hard to diff - e.g. rerooting changes everything!
  • Nexson are not a line based format
  • Repo size limits on github

Presenter Notes

In the future:

  • Semantic diffs
  • Pull requests

Presenter Notes

Presenter Notes

Is a git-based datastore right for your project?



- Maybe! Any of the OpenTree software team are happy to chat about pros and cons.

Presenter Notes

Thank you

Mark Holder
University of Kansas
OpenTree of Life project
Especially the Software team
NSF AVATOL #1208809

Presenter Notes