Summer Ontologizing

I’ve been quiet since April, but I’ve also been pretty busy.  Still fighting with the reimplementation of claim editing (a big, messy web page that if I could figure how to simplify further, I would).  I have also been focusing my efforts on a new ontology specifically for spider behavior (something to find in between the NBO/ABO and the data in arachnolingua).  I gave a talk about it at the 20th International Congress of Arachnology a few weeks ago.  The slides, rendered as PDF, are available here.

There is a link to the work in progress on the arachnolingua home page.   It is currently just the initial google doc sheet I used to collect usages across the two source texts.  I am finishing the first cleanup pass over the data and will provide an updated (and cleaned and better organized) sheet linked from a proper landing page in the coming week.



Why Arachnolingua will not use dc:terms to describe publications

Along with all the buzz and excitement (and over loaded servers) that accompanied the release of the OpenTree PNAS paper, there was a question about why OpenTree was not publishing RDF.  Jonathan Rees ably answered the question here (I refuse to quibble about the implication that I don’t know RDF), but the gist was that it could have been done, but RDF is easy to do badly, but hard to do right.  This is very true, and led me to reconsider, not whether to generate OWL, but how well I was living up to best practices.  Arachnolingua is not really an ontology and certainly doesn’t aspire to OBO foundry status, but many of the practices they recommend are worth following.  All the code and data are available, standard vocabularies are used whenever possible, and I will open this up to wider collaboration once things have settled down.

About the same time that the OpenTree paper came out, I was considering the implementation of landing pages for publications, as well as terms in the ARACHB namespace.  I am implementing a service that will provide a page for all terms of the form  These include generated identifiers for claims and individuals (more about that in a later post) as well as taxa without an NCBI id and publications without a doi.   The landing pages will be minimal to start and will be available as json as well as html.   There will be landing pages for publications with a doi as well, simply because there are things that Arachnolinuga has to say about a publication apart from what is on the publisher’s page that a doi generally resolves to (e.g., number of curated claims, individuals, etc.).

It is reasonable to expect that a landing page for publications would include the standard bibliographic information, especially for publications that lack a stable identifier (doi).  The information is available in the arachadmin database.  I originally thought to just generate pages, later to generate json files for each identifier with an html template, then realized that there was no compelling reason not to add this to the individual representing each publication.

Of course at this point, it would seem obvious to use the Dublin Core vocabulary.  But from my experience on the Dryad project as well as a discussion with Jonathan about two years ago made me suspicious.  I was already using the OBO IAO (information artifact ontology) so I started looking at IAO as an alternative to Dublin Core (DC).  Not a lot has been written about this, but Barry Smith has had something to say in this powerpoint.  Smith’s presentation makes it clear that while IAO is an ontology of information artifacts, DC is a vocabulary for tagging artifacts.  But looking at the set of properties provided by DC, it is less a vocabulary about how to find an artifact and more about rights management.  It looks like the DC vocabulary is used in places where it can be applied (e.g., dc:title, dc:contributor, dc:author) but for location, it merely provides a place to provide a link, not a bibliographic reference.  This matches my experience with both Dryad and Opentree – in the former we used DC terms where possible, and other vocabularies (e.g., prism) for other properties and in Opentree the DC vocabulary is not used at all, in favor of leaving the publication citation as an unparsed string, breaking out only the publication year, and using the ot namespace defined for the jsonized NeXML that studies are stored in.

In my searches, I found one project that, instead of using dc terms in the rdf, transformed all the dc properties into rdf:comment strings.  I plan to implement something similar, but more focused.  The bibliographic information will simply be turned into a json string that will appear as an rdf:comment on the publication individual.  This avoids needing to import the DC vocabulary or define a namespace for what are ultimately annotation strings.  The json will make page generation easy (the json page will simply be the untransformed comment string) and its meaning will be clear to anyone inspecting it.

Announcement: There is an update of Arachnolingua coming.  Preview soon at

Curation Update

As promised a few weeks back, I recently (more or less) completed some pretty substantial changes in the curation tool.  Specifically, I have changed the process of specifying ‘participants’ (actors, objects) associated with behavior events.  Participants typically include a spider or its anatomical parts and possibly another organism (mate, prey) or a substrate (silk, plant parts, a surface).  The old version simply allowed choice of a term for (spider) taxon and anatomy as the primary participant and a secondary substrate participant.  This allowed capturing simple events, but was completely inadequate for capturing social or predation interactions.  The old version did have the virtue of simplicity, and may return in some form of ‘wizard tool’ in the future.

The new version runs rather closer to the OWL that ultimately gets generated.  Claims consist of a ‘root’ element that captures the behavior term and a set of participants, each a tree structure (potentially, though most commonly a chain) of participant ‘elements’ that correspond to OWL classes, linked by arcs corresponding to OWL predicates (aka OBO relations).  The tree for a participant more or less corresponds to an OWL class or individual expression.  The interface still requires a great deal of polish, but I’ve added a simple d3 graph display that should make the structure of what’s being created a little clearer.


The ‘table of associated participants’ lists each participant using pidgin English and indicates its level of participation: ‘active participant’ or (passive) ‘participant’.  The graph below the table shows the relations between the participant elements (squares) and the root with predicates indicated by the color of connecting lines (red = active participant, orange = participant, blue = part of).   The link table on the right will (when I’ve finished implementing it) show more details about the ‘active’ participant element.  This table, along with color changing when an element is selected, remain to be implemented, though the selection mechanism is there.  Like many d3 graphs, node placement is ‘force-driven’ and can be tweaked by dragging with your pointing device.  The graph display is largely a transformation and simplification of this molecule example.

The corresponding upper part of the page looks like this:


This has been changed relatively little, except that drop selection fields for terms from taxon, anatomy, and environment terms have disappeared, they now appear during the element creation dialog that appears as an ajax form in the participant element section.  The text fields for publication taxon, anatomy and substrate remain.  These text metadata fields are associated with the participant as a whole, rather than the corresponding elements.

I’ve started writing some tests for the curation tool.  Adding Selenium to walk through the screens was relatively straightforward.  Unit tests have been rather more of a challenge, due, in part, to the eccentricities of the web2py framework.  But the issues I’ve encountered with web2py are large enough to deserve a post of their own.  Suffice it to say, there are some doctests in the code, but traditional unit tests remain in the future.

This is what I had been working on, as time permitted, over the second half of 2014.  I’ve now switched back to the owl generation side, which, as you may remember is a java tool that pulls from the database to build using the OWLAPI.  There is a lot of work to do there as well.  Since I switched I’ve started using the EclEmma test coverage module to try to improve the rather pathetic test harness I had for the owl generation.  At least unit testing has been fairly straightforward for now.

Quick update

It’s been a couple of months, but if you look over at github, you’ll see a merge that represents the (more or less for now) completion of a major overhaul to the claim editor in arachadmin.  Since the claim editor is the page that allows the curator to construct annotations for behavioral events or assertions, it’s effectively the heart of arachnolingua.  In brief, the interface has changed from a ‘wizard like’ form where the curator would specify a taxon, a body part and optionally a substrate, to something closer to a graph that represents the final OWL class expression.  I think this was the right way to handle getting individuals into claim expressions properly.  It also should make for a better curation experience, and allowed me to play with displaying graphics with d3.

This is not the final word on this update – there will be pictures in a follow-up post.  Meanwhile, it’s time to review the state of the backend and what needs to be added to capture the redesign that’s occurred in the database.

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:


Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.



Slow spell

Not as much to report recently, though there has been some Arachnolingua work.  The promised update of NCBI taxonomy hasn’t made it to release yet, so I’ve been concentrating on supporting non-NCBI taxa on both the curation side as well as the OWL generation side.  This will probably also be the mechanism I use for introducing taxa that aren’t Arachnids but relevant to Arachnid behavior (predator, prey, substrate, etc.) since I don’t really want to manage all of NCBI (or even NCBI pruned of all the non-organismal taxonomic terms).  The changes to Arachadmin are partially done and working, though merging these into the set of available terms at least needs some testing with actual curation of behavior.

I’ve also added the Environmental Ontology (EnvO) and the Population and Community Ontology (PCO) to the set of support ontologies.  The former will, hopefully provide some substrates as well as conditions that might affect behavior.  The PCO provides terms for populations (and groups of e.g., social spiders) and will provide a link between population level processes (e.g, juvenile dispersal) and individual behaviors (e.g., ballooning).  I have been somewhat involved in the later, and had the pleasure of meeting the main developer of EnvO, Pier Luigi Buttigieg, at the February RCN summit.

The other task I’ve been working on is getting the OwlTools incorporated into the OwlBuilder generator.  I seem to have resolved my maven issue, at least for local builds.  I still need to figure out if I can define the project to allow maven to find the OwlBuilder repository somewhere when I try building on Travis.  That still remains, as does integrating the ontology release tool with my own code.  Although arachb will never actually be an OBO ontology, many of the OBO foundry standards are relevant and appropriate and if I can automate compliance with as many of their standards as possible, it will be a good thing.

Finally not much progress on the annotation semantics work.

Arachnolingua was not affected by the HeartBleed bug we all heard so much about this past week.  Arguably this is for all the wrong reasons.   Arachnolingua doesn’t use SSL directly or even have certificates;  if the resource were actually authoritative about something, it would be worthwhile to support secure communications to prevent man-in-the-middle spoofing.  Likewise, Arachadmin could be set up as a real web tool with authorized curators and the like, but the need for such has not presented itself, though if the right people were interested…  So, because SSL isn’t used, Arachnolingua wasn’t affected.  Nevertheless, I did update the server to the patched version of the SSL library.  Meanwhile, it looks like I should update my other passwords (AWS, Namecheap, here, etc.).  Some of these have already been fixed.

I’m already looking forward to the summer – Evolution2014/iEvoBio and Animal Behavior 2014.  Plan is definitely to do a lightning talk for iEvoBio and either a different lightning talk (different audience and emphesis) for Animal Behavior, or maybe a poster.  Arachnolingua is alive and well, even if the last few weeks have been a bit of hibernation.






Updates, taxonomy, and post-publication review

I’m starting this hoping it will be quick, but thinking about it this morning, I realize there is a lot of ground I could be covering here.  So here goes…

Updates: Last week I tweaked the ethogram (taxonomy) view so that entering the name of a higher level taxon will retrieve behaviors for all included (subsumed) taxa.  This is implemented in the simple, non-elegant way – crawl the tree and retrieve the annotations using SPARQL for navigating, but the control is all implemented in java.  Of course traversing the tree has one advantage over a reasoner query to retrieve all included taxa – the results are guaranteed to come back in some sort of tree traversal order.  It works (try ‘Tetragnatha‘), but it is a bit slow.  I’ve also configured a more capable server, but haven’t deployed it yet, so be patient with these queries (there are some that seems to require 2-3 minutes to complete, I’ll let you figure out which).

Taxonomy: There’s not a lot new to report here – OpenTree has been keeping me busy these past few weeks.  I have been doing some more curation tool work to support taxa outside of NCBI and thanks to Chris Mungall and James Overton, there will soon be a new OWL rendering of the NCBI taxonomy in OWL which should make its way into the backend database soon.  I’m still tracking the addition of Arachnid taxa into NCBI – the majority of updates seem are sample records which won’t help with behavior, new species for ticks and spiders are trickling in as well.

Also, yesterday was Taxonomist Appreciation Day.  Although I have dabbled in taxonomy informatics (TDWG, VTO, a bit in OpenTree, as well as the taxonomy work here) I would never consider myself to be a taxonomist.  I do, as should any biologist, appreciate and thank the generations of taxonomists in the 250+ years since Linnaeus who have brought order and names for the millions of species we share this planet with.

Curation and Post Publication Review: A couple of items I found in twitter over the past few days have struck an interesting thought.  The first was a discussion of how curators of the UniProtKB database deal with changing understanding of the activity of the SiRT-5 protein.  This paper looked at how the UniProt curators responded to a changing understanding of the activity of this protein.  Initially this protein was understood to exhibit deacetylase activity, based primarily on documented activity of other members of the family and some in vitro assays that demonstrated the deacetylase activity. More recent papers have documented that the in vivo activity of this protein is more likely to be succinylation.   The paper describes how annotations in the UniProtKB were modified to incorporate both classes of activity in the appropriate contexts, providing a review process for the earlier reports in high of later results.  Thus the curation process provides a post-publication, albeit specialized, peer review.

This is relevant in light of this post I saw this morning on the likely limits of post-publication peer review.  Now, the particular papers discussed in the UniProt example were published in high profile journals such as Cell and Science, so the particular case does not speak against the 1% notion mentioned in the Dynamic Ecology post.  But not all curation is focused on the sort of topics that make it into the elite 1% of published papers.  My publication database does have a few papers from Science, Nature and one or two other high profile publications.  But the majority come from places such as the Journal of Arachnology, Animal Behavior, or lesser known journals from Japan or Latin America.  This leads me to a somewhat more optimistic conclusion about the future of post-publication peer review than Jeremy Fox.

Next week, I hope to discuss some of the papers I’m in the process of adding and possibly return to the issue of front-end data stores.

A bit of curation

Over the weekend I spent some time messing with publications and merging author strings and made some progress but decided it wasn’t as much of a priority as shaking down the existing system by adding some more annotations.  So last night I added 7 more annotations from Aiken and Coyle’s 2000 Tetragnatha survey.  This added variety in behavior, taxonomy and anatomy as well as forcing me to confront the lack of vocabulary for substrates.  In particular there are annotations for prey handling and wrapping.  Since the paper doesn’t explicitly identify the prey involved, assuming anything beyond arthropod (which isn’t even 100% certain), it may be best to identify a term for prey from an ecological ontology (maybe something to pursue at the PCO workshop in Tucson I’ll be attending in two weeks).

I ran into a couple of minor problems, one of which appeared as a crash caused by a publication that had an empty string rather than a NULL in its database DOI field.  The more interesting fix was to add the code from pulling in parents and annotations from anatomy terms (I missed that in the first pass and just got lucky with the two anatomy terms I used).  This revealed itself when I loaded the OWL file into Protege and to see that ‘whole organism’ was no longer labeled (it appeared by its OBO identifier).  In retrospect this is a little strange, since ‘whole organism’ was one of two anatomy terms that were used with the first two annotations, however, adding the parent and annotation (e.g., rdfs:label) extraction seems to have resolved the problem.

Yesterday I got a query on this blog about my use of the ELK reasoner.  I paid a bit more attention to what it was reporting last night.  It is currently run over the merge of all the support ontologies (which constitutes the import closure of the ontologies that actually get used in annotation) and allows querying of the subsuming (parent) classes and, in the case of taxonomy, the subsumed (child) classes.  These queries determine what will be pulled into the target, which is represented as a separate OWL ontology.  Now this fairly large collection of 11 ontologies cover a range of expressiveness from AL (what Protege calls the base attributive language) through to a number of ontologies that Protege reports as SIQ.  Apparently these are complex enough to cause ELK to complain, but ignore a number of axioms.  For the present purpose the reasoning is sufficient even if it may be incomplete.

What I haven’t done yet, and probably should, is run ELK across the generated target KB and see whether there are problems.  I noticed that running FACT++ within Protege did reduce the number of root concepts when I reviewed the output of OWLBuilder, so there is something to be gained.

In any case, there are now short ethogram listings for the genera Tetragnatha and Deinopis, as well as some new annotations for T. straminea.  I’m not sure whether to tackle improved display for the ethogram table or try a behavior catalog (behavior hierarchy with taxon counts) or a anatomical catalog yet.

Also, the response time is starting to be noticable (generally 1.5-2 sec for ethogram queries).  It may be that speed will bump me up to the next AWS tier, while I had expected memory to be the constraint, but I’m not ready to make that move yet.  Stay tuned.



Where are the spiders?

Some readers may be wondering why a blog titled arachnolingua has so little about spiders, to say nothing about the numerous other arachnids (you know: mites, ticks,  scorpions, harvestmen, etc.)   It’s true, this has been all about databases and OWL and website building – where’s the behavior and the beasties.  Well, those will have to wait until I have the infrastructure put together enough to start serious curation.  I have a nice collection of literature already collected and there is some curated data that is making it’s way into the database, though I’ve been cautious in case something breaks before things settle down.  There is some curated information up on the website – if you go to the testing page, and hit the ‘list all publications’ button, it will display a table with identifiers for the 559 publications already in the database.  If you scroll down, about half-way through the list, you’ll start seeing doi’s – feel free to check them out.  The other publications need arachb identifiers as they don’t seem to have doi’s.

Meanwhile, if you’re looking for spider behavior, have a look at Spiderbytes, a new blog by Catherine Scott, which so far is turning out to be a nice mix of natural history and taxonomy.

NBO and all its pieces

This morning I tested the code that loads ontologies that arachnolingua references in assertions.  These are currently the spider biology ontology (SPD), the NCBI taxonomy (NCBITAXON), the Evidence Codes Ontology (ECO), and of course the NeuroBehavior Ontology (NBO).  I’ve wrestled with the NCBI taxonomy for years (e.g., as the basis of several parts of Phenoscape’s Vertebrate Taxonomy Ontology) and know that it is both large and filled with terms that will be either out of scope (e.g., environmental samples and probably most bacteria, viruses, and marine organisms) or simply unusable (e.g. children of ‘unclassified taxa’ such as Salticidae sp. RBCH101-04, a child of ‘unclassified Salticidae’), which exist only as a genetic sample, never to exhibit behavior.  In the frontend, I’ve attacked the scope of NCBI by parsing the RDF/XML OWL rendering (thanks to Chris Mungall), building a tree just long enough to determine trim down to the Arachnid clade and trimming any children of taxa named ‘unclassified XXX’.  This (incorrect) solution gives a list of about 11,000 taxa, small enough not to overload the drop-down list.  Of course many non-arachnid taxa will be needed for curation, whether as prey, predators, parasites, hosts, or simply substrates (e.g., plants).  I expect these will be introduced using the mechanism I’ve planned for Arachnids not currently in NCBI – the ability to add names, list an authority (e.g., World Catalog of Spiders), and provide enough parentage to link to a higher level group known to NCBI.

So NCBI is big, but a relatively well understood problem.  This morning NCBI took almost a minute to load from a load copy – this seems slow, but this was in owlbuilder, which was reading with the OWLAPI.  Not only slow, but memory hungry – I only got good performance after bumping the virtual size of the jvm to 4 GB.  After NCBI, owlbuilder attempted to load NBO and this is where things got messy.  Unlike the simple ‘build and filter a list of terms’ that the arachadmin editor uses, OWLAPI honors all the import statements, which means, as it turn out, that NBO triggers loading of the Chemical Entities of Biological Interest (CHEBI), the GeneOntology (GO), the Phenotypic Quality Ontology (PATO), and Uberon, an umbrella anatomy ontology that is intended to cover all bilateralia, but is currently more vertebrate than arthropod focused.  Of these, CHEBI will likely be of interest given the number of chemical communication papers I’ve identified, and GO is likely to be useful as an alternative source of behavior terms.  Uberon may be more useful in the future, but currently there is no integration with the spider ontology.  I have been remiss up to now in not including the tick gross anatomy ontology (TADS), but that should be added to the frontend before I start curating relevant papers.  Perhaps Uberon will cover Chelicerata by the time serious curation of Acari papers starts.  PATO is another question altogether – it’s an ontology of phenotypical qualities, which includes a lot of behavior qualities, but currently Arachnolingua is not focussed on capturing entity-quality annotations (e.g., Phenoscape).  Loading PATO is actually where the load died with a network timeout this morning.  Looks like I will need to support local caching for the backend, and I’m considering using a common cache directory for both the back and frontend for the time being (I’m in no rush to move the backend processing to AWS).

So, it looks like I’ll be making adding a local cache location to the source_ontology table in arachadmin.