Updates, taxonomy, and post-publication review

I’m starting this hoping it will be quick, but thinking about it this morning, I realize there is a lot of ground I could be covering here.  So here goes…

Updates: Last week I tweaked the ethogram (taxonomy) view so that entering the name of a higher level taxon will retrieve behaviors for all included (subsumed) taxa.  This is implemented in the simple, non-elegant way – crawl the tree and retrieve the annotations using SPARQL for navigating, but the control is all implemented in java.  Of course traversing the tree has one advantage over a reasoner query to retrieve all included taxa – the results are guaranteed to come back in some sort of tree traversal order.  It works (try ‘Tetragnatha‘), but it is a bit slow.  I’ve also configured a more capable server, but haven’t deployed it yet, so be patient with these queries (there are some that seems to require 2-3 minutes to complete, I’ll let you figure out which).

Taxonomy: There’s not a lot new to report here – OpenTree has been keeping me busy these past few weeks.  I have been doing some more curation tool work to support taxa outside of NCBI and thanks to Chris Mungall and James Overton, there will soon be a new OWL rendering of the NCBI taxonomy in OWL which should make its way into the backend database soon.  I’m still tracking the addition of Arachnid taxa into NCBI – the majority of updates seem are sample records which won’t help with behavior, new species for ticks and spiders are trickling in as well.

Also, yesterday was Taxonomist Appreciation Day.  Although I have dabbled in taxonomy informatics (TDWG, VTO, a bit in OpenTree, as well as the taxonomy work here) I would never consider myself to be a taxonomist.  I do, as should any biologist, appreciate and thank the generations of taxonomists in the 250+ years since Linnaeus who have brought order and names for the millions of species we share this planet with.

Curation and Post Publication Review: A couple of items I found in twitter over the past few days have struck an interesting thought.  The first was a discussion of how curators of the UniProtKB database deal with changing understanding of the activity of the SiRT-5 protein.  This paper looked at how the UniProt curators responded to a changing understanding of the activity of this protein.  Initially this protein was understood to exhibit deacetylase activity, based primarily on documented activity of other members of the family and some in vitro assays that demonstrated the deacetylase activity. More recent papers have documented that the in vivo activity of this protein is more likely to be succinylation.   The paper describes how annotations in the UniProtKB were modified to incorporate both classes of activity in the appropriate contexts, providing a review process for the earlier reports in high of later results.  Thus the curation process provides a post-publication, albeit specialized, peer review.

This is relevant in light of this post I saw this morning on the likely limits of post-publication peer review.  Now, the particular papers discussed in the UniProt example were published in high profile journals such as Cell and Science, so the particular case does not speak against the 1% notion mentioned in the Dynamic Ecology post.  But not all curation is focused on the sort of topics that make it into the elite 1% of published papers.  My publication database does have a few papers from Science, Nature and one or two other high profile publications.  But the majority come from places such as the Journal of Arachnology, Animal Behavior, or lesser known journals from Japan or Latin America.  This leads me to a somewhat more optimistic conclusion about the future of post-publication peer review than Jeremy Fox.

Next week, I hope to discuss some of the papers I’m in the process of adding and possibly return to the issue of front-end data stores.

Advertisement

NBO and all its pieces

This morning I tested the code that loads ontologies that arachnolingua references in assertions.  These are currently the spider biology ontology (SPD), the NCBI taxonomy (NCBITAXON), the Evidence Codes Ontology (ECO), and of course the NeuroBehavior Ontology (NBO).  I’ve wrestled with the NCBI taxonomy for years (e.g., as the basis of several parts of Phenoscape’s Vertebrate Taxonomy Ontology) and know that it is both large and filled with terms that will be either out of scope (e.g., environmental samples and probably most bacteria, viruses, and marine organisms) or simply unusable (e.g. children of ‘unclassified taxa’ such as Salticidae sp. RBCH101-04, a child of ‘unclassified Salticidae’), which exist only as a genetic sample, never to exhibit behavior.  In the frontend, I’ve attacked the scope of NCBI by parsing the RDF/XML OWL rendering (thanks to Chris Mungall), building a tree just long enough to determine trim down to the Arachnid clade and trimming any children of taxa named ‘unclassified XXX’.  This (incorrect) solution gives a list of about 11,000 taxa, small enough not to overload the drop-down list.  Of course many non-arachnid taxa will be needed for curation, whether as prey, predators, parasites, hosts, or simply substrates (e.g., plants).  I expect these will be introduced using the mechanism I’ve planned for Arachnids not currently in NCBI – the ability to add names, list an authority (e.g., World Catalog of Spiders), and provide enough parentage to link to a higher level group known to NCBI.

So NCBI is big, but a relatively well understood problem.  This morning NCBI took almost a minute to load from a load copy – this seems slow, but this was in owlbuilder, which was reading with the OWLAPI.  Not only slow, but memory hungry – I only got good performance after bumping the virtual size of the jvm to 4 GB.  After NCBI, owlbuilder attempted to load NBO and this is where things got messy.  Unlike the simple ‘build and filter a list of terms’ that the arachadmin editor uses, OWLAPI honors all the import statements, which means, as it turn out, that NBO triggers loading of the Chemical Entities of Biological Interest (CHEBI), the GeneOntology (GO), the Phenotypic Quality Ontology (PATO), and Uberon, an umbrella anatomy ontology that is intended to cover all bilateralia, but is currently more vertebrate than arthropod focused.  Of these, CHEBI will likely be of interest given the number of chemical communication papers I’ve identified, and GO is likely to be useful as an alternative source of behavior terms.  Uberon may be more useful in the future, but currently there is no integration with the spider ontology.  I have been remiss up to now in not including the tick gross anatomy ontology (TADS), but that should be added to the frontend before I start curating relevant papers.  Perhaps Uberon will cover Chelicerata by the time serious curation of Acari papers starts.  PATO is another question altogether – it’s an ontology of phenotypical qualities, which includes a lot of behavior qualities, but currently Arachnolingua is not focussed on capturing entity-quality annotations (e.g., Phenoscape).  Loading PATO is actually where the load died with a network timeout this morning.  Looks like I will need to support local caching for the backend, and I’m considering using a common cache directory for both the back and frontend for the time being (I’m in no rush to move the backend processing to AWS).

So, it looks like I’ll be making adding a local cache location to the source_ontology table in arachadmin.