Updates, taxonomy, and post-publication review

I’m starting this hoping it will be quick, but thinking about it this morning, I realize there is a lot of ground I could be covering here.  So here goes…

Updates: Last week I tweaked the ethogram (taxonomy) view so that entering the name of a higher level taxon will retrieve behaviors for all included (subsumed) taxa.  This is implemented in the simple, non-elegant way – crawl the tree and retrieve the annotations using SPARQL for navigating, but the control is all implemented in java.  Of course traversing the tree has one advantage over a reasoner query to retrieve all included taxa – the results are guaranteed to come back in some sort of tree traversal order.  It works (try ‘Tetragnatha‘), but it is a bit slow.  I’ve also configured a more capable server, but haven’t deployed it yet, so be patient with these queries (there are some that seems to require 2-3 minutes to complete, I’ll let you figure out which).

Taxonomy: There’s not a lot new to report here – OpenTree has been keeping me busy these past few weeks.  I have been doing some more curation tool work to support taxa outside of NCBI and thanks to Chris Mungall and James Overton, there will soon be a new OWL rendering of the NCBI taxonomy in OWL which should make its way into the backend database soon.  I’m still tracking the addition of Arachnid taxa into NCBI – the majority of updates seem are sample records which won’t help with behavior, new species for ticks and spiders are trickling in as well.

Also, yesterday was Taxonomist Appreciation Day.  Although I have dabbled in taxonomy informatics (TDWG, VTO, a bit in OpenTree, as well as the taxonomy work here) I would never consider myself to be a taxonomist.  I do, as should any biologist, appreciate and thank the generations of taxonomists in the 250+ years since Linnaeus who have brought order and names for the millions of species we share this planet with.

Curation and Post Publication Review: A couple of items I found in twitter over the past few days have struck an interesting thought.  The first was a discussion of how curators of the UniProtKB database deal with changing understanding of the activity of the SiRT-5 protein.  This paper looked at how the UniProt curators responded to a changing understanding of the activity of this protein.  Initially this protein was understood to exhibit deacetylase activity, based primarily on documented activity of other members of the family and some in vitro assays that demonstrated the deacetylase activity. More recent papers have documented that the in vivo activity of this protein is more likely to be succinylation.   The paper describes how annotations in the UniProtKB were modified to incorporate both classes of activity in the appropriate contexts, providing a review process for the earlier reports in high of later results.  Thus the curation process provides a post-publication, albeit specialized, peer review.

This is relevant in light of this post I saw this morning on the likely limits of post-publication peer review.  Now, the particular papers discussed in the UniProt example were published in high profile journals such as Cell and Science, so the particular case does not speak against the 1% notion mentioned in the Dynamic Ecology post.  But not all curation is focused on the sort of topics that make it into the elite 1% of published papers.  My publication database does have a few papers from Science, Nature and one or two other high profile publications.  But the majority come from places such as the Journal of Arachnology, Animal Behavior, or lesser known journals from Japan or Latin America.  This leads me to a somewhat more optimistic conclusion about the future of post-publication peer review than Jeremy Fox.

Next week, I hope to discuss some of the papers I’m in the process of adding and possibly return to the issue of front-end data stores.

Advertisements

taxonomy and curation

Over the weekend I spent some time on the taxa that, although mentioned in curated papers, haven’t yet made their way into NCBI.  You can see a list on the taxonomy status page.  Although there already was a taxon table in the admin database, it was never filled after my decision to put all the terms (including taxonomy) in one common table, tagged with domain and authority codes.  So I spent some time adding fields and implementing infrastructure so these names and their identifiers and authority can be captured, and eventually have terms generated.  These need to be maintained in a separate table so they can be regenerated when the term table is reloaded from the set of support ontologies.  The merge remains to be implemented.  Happily I have been able to resolve all the names so far using the World Spider Catalog, which seems to be authoritative.  

The taxonomy status page started last year as a simple list of the taxa I couldn’t find in NCBI.  A little over a month ago, I cleaned it up and found urn identifiers for each of the missing names in the spider catalog.  In some case these were synonyms.  On Sunday I discovered that I had missed a synonym (the name contained an alternative, unsupported, spelling of the genus), which has been cleaned up now as well.

I also cleaned up a couple of issues that prevented full access for people coming from arachnolingua.net having full access to the site.

Meanwhile I am waiting the answer to a few modeling questions that will hopefully allow me to display some text showing the original context of the behavior terms.  This should may make the modeling and granularity issues I’m facing a little clearer.  Arachnolingua is fundamentally a database of usages, not simply terms, with the primary intent of supporting comparative analysis of narrative description and other ‘pre-character’ data relevant to behavior.  Part of this is allowing conflicting assertions to be highlighted, if not resolved.

Curation on a project like this involves growth in both breadth and depth.  Most of the work at the start will be with depth so that something can be said about anything.  Arachnolingua is still at that stage, but it is possible to start building the set of annotations, at least at the granularity of a first pass.  The total number of annotations in the KB is up to 15 and there are things, even in the existing annotations that the system is not capable of displaying.

RCN summit and some updates

I haven’t posted in a while because I was at the Phenotype RCN summit the week before last.  The week started with two workshops covering the Population and Community Ecology Ontology (PCO) and the Biological Collections Ontology (BCO).  I have been interested in the PCO for several years, both because my PhD project involved population level processes (‘culture’ in a population of birds) as well as population related processes in spiders (social spiders, but also predation and predator defense which, by involving multiple individuals are at the boundary of behavior and population/community processes).

One of the high points of the RCN summit was seeing Martin Ramirez again.   He has been continuing to work with the spider anatomy ontology and showed an interesting comparative analysis of complexity of reproductive structures using derived species-level anatomy ontologies.  I also showed him a bit of arachnolingua and we left open the possibility of more collaboration in the coming year.

I also got to have some brief discussions with Chris Mungall about NCBI taxonomy and the possibility of more frequent updates to the generated OWL from the source.  I expect that other projects would benefit from more frequent updates since not many projects are in the position to support a separate taxonomy ontology.

There has also been progress on followup workshops for the NBO.  More here and in the other blog as updates warrant.

Since I returned, I’ve had a couple of hours to clean up a number of issues on the web page: most significantly, I’ve added javascript code to turn the urls in the ethogram results table (currently just the publications) into links.  I’ve also generated simple publication labels (author list + publication year) so tracking the results doesn’t require decoding a doi.  The publication column now shows the label and links to the doi (which works for both publications with annotations in the KB).  Eventually I will change the link to point to a publication summary page which will link to the doi when it is available, but that will remain a low priority until I start annotating publications without doi’s.

An unexpected, but reasonable side effect is the change in behavior of the display of taxa without annotations.  They now treat the purls as links, which unfortunately don’t resolve.  This doesn’t seem to be something to fight, so I’ll convert them into links back to the NCBI taxon page on the fly (just extract the taxon id and generate a new URL).

A more immediate issue is getting the descriptions of the behavior to display.  This isn’t hard, but I’m taking a day or two to give some thought to the best way to model this (the simplest would be an rdfs:label or rdfs:comment on the annotation, but it would be better to use some IAO vocabulary to attach it to the node that represents the individual part of the publication being annotated).

I’ve pulled out the hardcopies of publications I made a year or so back; getting ready for some more sustained curation activity (as well as more changes to arachadmin).