Summer Ontologizing

I’ve been quiet since April, but I’ve also been pretty busy.  Still fighting with the reimplementation of claim editing (a big, messy web page that if I could figure how to simplify further, I would).  I have also been focusing my efforts on a new ontology specifically for spider behavior (something to find in between the NBO/ABO and the data in arachnolingua).  I gave a talk about it at the 20th International Congress of Arachnology a few weeks ago.  The slides, rendered as PDF, are available here.

There is a link to the work in progress on the arachnolingua home page.   It is currently just the initial google doc sheet I used to collect usages across the two source texts.  I am finishing the first cleanup pass over the data and will provide an updated (and cleaned and better organized) sheet linked from a proper landing page in the coming week.

 

Advertisement

Post ABS activities

The ontology workshop preceding ABS (August 7-8) was quite successful and the three other organizers and I should soon be planning the next meeting, which will be the four of us meeting with other ontology types from the OBO community.  The poster at ABS wasn’t a huge success, but several people came by and most seemed to understand what I was trying to do, even if they hadn’t heard of NBO previously.  The take home for me was that having a relatively strong anatomy ontology helped overcome some of the annotation problems that NBO’s (lack of) coverage for spider related behaviors.

So what have I been up to since ABS.  Most of my arachnolinuga focus has been on attempting to properly support participant expressions, especially to be able to represent individuals properly.  There are numerous challenges here.  For example, how to represent a set (one or more) of legs that are part_of an individual spider.  ‘Some leg part_of Spider001’ doesn’t seem to fly, at least when I try to construct it in Protege, so I’m leery that even if the OWLAPI lets me create such an expression, other tools and reasoners will either ignore it or choke on it.  For the moment I am punting on this issue – specify an individual leg; if there are others explicitly mentioned, they can be specified as additional individuals, otherwise saying one leg was involved in an act is a failure of omission, which seems to be the lessor evil.

For the moment, this support is focusing on the arachadmin database and editor.  I will worry about OWL generation once the representation is stable in the database, which will, in turn, depend on what is required to sensibly build statements in the editor.

 

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:

Image

Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.

 

 

Slow spell

Not as much to report recently, though there has been some Arachnolingua work.  The promised update of NCBI taxonomy hasn’t made it to release yet, so I’ve been concentrating on supporting non-NCBI taxa on both the curation side as well as the OWL generation side.  This will probably also be the mechanism I use for introducing taxa that aren’t Arachnids but relevant to Arachnid behavior (predator, prey, substrate, etc.) since I don’t really want to manage all of NCBI (or even NCBI pruned of all the non-organismal taxonomic terms).  The changes to Arachadmin are partially done and working, though merging these into the set of available terms at least needs some testing with actual curation of behavior.

I’ve also added the Environmental Ontology (EnvO) and the Population and Community Ontology (PCO) to the set of support ontologies.  The former will, hopefully provide some substrates as well as conditions that might affect behavior.  The PCO provides terms for populations (and groups of e.g., social spiders) and will provide a link between population level processes (e.g, juvenile dispersal) and individual behaviors (e.g., ballooning).  I have been somewhat involved in the later, and had the pleasure of meeting the main developer of EnvO, Pier Luigi Buttigieg, at the February RCN summit.

The other task I’ve been working on is getting the OwlTools incorporated into the OwlBuilder generator.  I seem to have resolved my maven issue, at least for local builds.  I still need to figure out if I can define the project to allow maven to find the OwlBuilder repository somewhere when I try building on Travis.  That still remains, as does integrating the ontology release tool with my own code.  Although arachb will never actually be an OBO ontology, many of the OBO foundry standards are relevant and appropriate and if I can automate compliance with as many of their standards as possible, it will be a good thing.

Finally not much progress on the annotation semantics work.

Arachnolingua was not affected by the HeartBleed bug we all heard so much about this past week.  Arguably this is for all the wrong reasons.   Arachnolingua doesn’t use SSL directly or even have certificates;  if the resource were actually authoritative about something, it would be worthwhile to support secure communications to prevent man-in-the-middle spoofing.  Likewise, Arachadmin could be set up as a real web tool with authorized curators and the like, but the need for such has not presented itself, though if the right people were interested…  So, because SSL isn’t used, Arachnolingua wasn’t affected.  Nevertheless, I did update the server to the patched version of the SSL library.  Meanwhile, it looks like I should update my other passwords (AWS, Namecheap, here, etc.).  Some of these have already been fixed.

I’m already looking forward to the summer – Evolution2014/iEvoBio and Animal Behavior 2014.  Plan is definitely to do a lightning talk for iEvoBio and either a different lightning talk (different audience and emphesis) for Animal Behavior, or maybe a poster.  Arachnolingua is alive and well, even if the last few weeks have been a bit of hibernation.

 

 

 

 

 

RCN summit and some updates

I haven’t posted in a while because I was at the Phenotype RCN summit the week before last.  The week started with two workshops covering the Population and Community Ecology Ontology (PCO) and the Biological Collections Ontology (BCO).  I have been interested in the PCO for several years, both because my PhD project involved population level processes (‘culture’ in a population of birds) as well as population related processes in spiders (social spiders, but also predation and predator defense which, by involving multiple individuals are at the boundary of behavior and population/community processes).

One of the high points of the RCN summit was seeing Martin Ramirez again.   He has been continuing to work with the spider anatomy ontology and showed an interesting comparative analysis of complexity of reproductive structures using derived species-level anatomy ontologies.  I also showed him a bit of arachnolingua and we left open the possibility of more collaboration in the coming year.

I also got to have some brief discussions with Chris Mungall about NCBI taxonomy and the possibility of more frequent updates to the generated OWL from the source.  I expect that other projects would benefit from more frequent updates since not many projects are in the position to support a separate taxonomy ontology.

There has also been progress on followup workshops for the NBO.  More here and in the other blog as updates warrant.

Since I returned, I’ve had a couple of hours to clean up a number of issues on the web page: most significantly, I’ve added javascript code to turn the urls in the ethogram results table (currently just the publications) into links.  I’ve also generated simple publication labels (author list + publication year) so tracking the results doesn’t require decoding a doi.  The publication column now shows the label and links to the doi (which works for both publications with annotations in the KB).  Eventually I will change the link to point to a publication summary page which will link to the doi when it is available, but that will remain a low priority until I start annotating publications without doi’s.

An unexpected, but reasonable side effect is the change in behavior of the display of taxa without annotations.  They now treat the purls as links, which unfortunately don’t resolve.  This doesn’t seem to be something to fight, so I’ll convert them into links back to the NCBI taxon page on the fly (just extract the taxon id and generate a new URL).

A more immediate issue is getting the descriptions of the behavior to display.  This isn’t hard, but I’m taking a day or two to give some thought to the best way to model this (the simplest would be an rdfs:label or rdfs:comment on the annotation, but it would be better to use some IAO vocabulary to attach it to the node that represents the individual part of the publication being annotated).

I’ve pulled out the hardcopies of publications I made a year or so back; getting ready for some more sustained curation activity (as well as more changes to arachadmin).

 

 

 

It’s Darwin Day – I’ll celebrate by updating the site

I’ve spent a couple of hours cleaning up the arachnolingua site.  In particular, the link to the old ontology file is gone from the top menu bar – that file hasn’t been valid in nearly a year and even the current file doesn’t deserve that level of billing.  If you want to see the OWL file, there is a link on the project page (available from the ‘about’ menu item).  The new OWL file is, I’m happy to announce, a reasoned over version, with inferred axioms covering  inferences of subclass, equivalent class, and class assertion being added to the ontology.   ELK didn’t complain (unlike when it reasons over the mashup of supporting ontologies) and a quick inspection in Protege showed no apparent changes when the resulting ontology was re-reasoned with FACT++ (unlike when the unreasoned ontology was loaded – several top level classes disappeared after reasoning).  The lack of improvement resulting from reasoning in Protege doesn’t say much about FACT++ vs. ELK, simply that the KB is expressively simple enough that it is covered by the portion of OWL-EL that ELK reasons over.  It’s also still quite small, so the speed advantage of ELK isn’t very obvious in the second pass.  Hopefully arachb will keep growing past the point where this is no longer the case.

The project page has also been updated to link to the source ontologies directly rather than the MIREOT’ed owl files that went with the old KB file.  Perhaps these and other questionable links have something to do with all the hits on the projects page (and only the projects page) from Russian sources that my logs the past week or so have been showing.

The other big change is the attention I’ve given to the taxonomy status page.  First, a word of explanation: the taxonomy status page is a list of taxa I’ve encountered during curation (so the publications listed on this page have undergone some review and behavior annotation and will likely be the next ones to appear in the KB) that do not appear in the NCBI taxonomy.  The list here is over a year old, but none of the names listed here have appeared in the intervening months.  In some cases (e.g., synonymy) this is to be expected.  In any case, I’m using the World Spider Catalog as my authority for cases where names don’t appear in NCBI.  Because the WSC text pages don’t contain full species names and lists of genera are split by family, I’ve used wikipedia and EoL to get family names for these taxa.  Looks like this may trigger some EoL contributions on my part.

In any case, enjoy and feel free to comment on what you see, suggest papers to review (or bump up the queue) or any other suggestions.

 

 

 

A bit of curation

Over the weekend I spent some time messing with publications and merging author strings and made some progress but decided it wasn’t as much of a priority as shaking down the existing system by adding some more annotations.  So last night I added 7 more annotations from Aiken and Coyle’s 2000 Tetragnatha survey.  This added variety in behavior, taxonomy and anatomy as well as forcing me to confront the lack of vocabulary for substrates.  In particular there are annotations for prey handling and wrapping.  Since the paper doesn’t explicitly identify the prey involved, assuming anything beyond arthropod (which isn’t even 100% certain), it may be best to identify a term for prey from an ecological ontology (maybe something to pursue at the PCO workshop in Tucson I’ll be attending in two weeks).

I ran into a couple of minor problems, one of which appeared as a crash caused by a publication that had an empty string rather than a NULL in its database DOI field.  The more interesting fix was to add the code from pulling in parents and annotations from anatomy terms (I missed that in the first pass and just got lucky with the two anatomy terms I used).  This revealed itself when I loaded the OWL file into Protege and to see that ‘whole organism’ was no longer labeled (it appeared by its OBO identifier).  In retrospect this is a little strange, since ‘whole organism’ was one of two anatomy terms that were used with the first two annotations, however, adding the parent and annotation (e.g., rdfs:label) extraction seems to have resolved the problem.

Yesterday I got a query on this blog about my use of the ELK reasoner.  I paid a bit more attention to what it was reporting last night.  It is currently run over the merge of all the support ontologies (which constitutes the import closure of the ontologies that actually get used in annotation) and allows querying of the subsuming (parent) classes and, in the case of taxonomy, the subsumed (child) classes.  These queries determine what will be pulled into the target, which is represented as a separate OWL ontology.  Now this fairly large collection of 11 ontologies cover a range of expressiveness from AL (what Protege calls the base attributive language) through to a number of ontologies that Protege reports as SIQ.  Apparently these are complex enough to cause ELK to complain, but ignore a number of axioms.  For the present purpose the reasoning is sufficient even if it may be incomplete.

What I haven’t done yet, and probably should, is run ELK across the generated target KB and see whether there are problems.  I noticed that running FACT++ within Protege did reduce the number of root concepts when I reviewed the output of OWLBuilder, so there is something to be gained.

In any case, there are now short ethogram listings for the genera Tetragnatha and Deinopis, as well as some new annotations for T. straminea.  I’m not sure whether to tackle improved display for the ethogram table or try a behavior catalog (behavior hierarchy with taxon counts) or a anatomical catalog yet.

Also, the response time is starting to be noticable (generally 1.5-2 sec for ethogram queries).  It may be that speed will bump me up to the next AWS tier, while I had expected memory to be the constraint, but I’m not ready to make that move yet.  Stay tuned.

 

 

Playing with ELK (the reasoner)

It’s been a long slog the past few weeks.  I’ve been pulling IRI’s for taxon, anatomy, substrates (entities that aren’t part_of an actor) that are associated with participants in a behavior, as well as IRIs for behavior and publication for assertions.  Yesterday, I finally had all (at least most) of the pieces together – I was now ready to start matching term IRIs against the support ontologies and copying OWL ‘entities’ from the support ontologies into the target ontology that will become the OWL file that’s loaded onto the server.

The first step was to merge all the support ontologies and run a classifier over them – primary to determine the class hierarchy.   The merging went smoothly and didn’t take too long, but trying the OWLAPI’s structural reasoner on the 7436381 assertions that resulted from merging the 8 support ontologies seemed a bit too much for it.  After 45 minutes on a 4-core i7, I decided it was time to try something else.

ELK  has been attracting some interest in the biological ontologies community in the past couple of years as a very fast way to do reasoning for ontologies that can stay within the limits of the OWL-EL language profile.  As it turns out, the current version of ELK currently implements only a subset, but that subset is more than sufficient my very limited immediate needs.

The first task for the reasoner was simply to extract the superclass closure (all the classes above) Arachnida in the NCBI taxonomy hierarch, followed by all subclasses of the same taxon.  Those, along with Arachnida itself (which is a taxonomic class as well as a OWL class) are copied into the target ontology, along with the axioms specifying their super/sub class relations and their labels (= Linnean names).

It all works – owlbuilder is generating an owl file that loads in Protege (after making sure the DOI cleanup was getting called in the right place), and contains a couple of other classes pertaining to an as yet incomplete representation of a posture in Tetragnatha straminea.  Nothing special about this behavior or species (surprised the first species wasn’t a jumping spider?), just the first publication that came up in the literature search all those years ago was about a couple of Tetragnatha species.  You’ll be hearing more about this behavior and a couple of other behaviors in this species and some congeners as I fill in the pieces and start pushing real data to the server.

NBO and all its pieces

This morning I tested the code that loads ontologies that arachnolingua references in assertions.  These are currently the spider biology ontology (SPD), the NCBI taxonomy (NCBITAXON), the Evidence Codes Ontology (ECO), and of course the NeuroBehavior Ontology (NBO).  I’ve wrestled with the NCBI taxonomy for years (e.g., as the basis of several parts of Phenoscape’s Vertebrate Taxonomy Ontology) and know that it is both large and filled with terms that will be either out of scope (e.g., environmental samples and probably most bacteria, viruses, and marine organisms) or simply unusable (e.g. children of ‘unclassified taxa’ such as Salticidae sp. RBCH101-04, a child of ‘unclassified Salticidae’), which exist only as a genetic sample, never to exhibit behavior.  In the frontend, I’ve attacked the scope of NCBI by parsing the RDF/XML OWL rendering (thanks to Chris Mungall), building a tree just long enough to determine trim down to the Arachnid clade and trimming any children of taxa named ‘unclassified XXX’.  This (incorrect) solution gives a list of about 11,000 taxa, small enough not to overload the drop-down list.  Of course many non-arachnid taxa will be needed for curation, whether as prey, predators, parasites, hosts, or simply substrates (e.g., plants).  I expect these will be introduced using the mechanism I’ve planned for Arachnids not currently in NCBI – the ability to add names, list an authority (e.g., World Catalog of Spiders), and provide enough parentage to link to a higher level group known to NCBI.

So NCBI is big, but a relatively well understood problem.  This morning NCBI took almost a minute to load from a load copy – this seems slow, but this was in owlbuilder, which was reading with the OWLAPI.  Not only slow, but memory hungry – I only got good performance after bumping the virtual size of the jvm to 4 GB.  After NCBI, owlbuilder attempted to load NBO and this is where things got messy.  Unlike the simple ‘build and filter a list of terms’ that the arachadmin editor uses, OWLAPI honors all the import statements, which means, as it turn out, that NBO triggers loading of the Chemical Entities of Biological Interest (CHEBI), the GeneOntology (GO), the Phenotypic Quality Ontology (PATO), and Uberon, an umbrella anatomy ontology that is intended to cover all bilateralia, but is currently more vertebrate than arthropod focused.  Of these, CHEBI will likely be of interest given the number of chemical communication papers I’ve identified, and GO is likely to be useful as an alternative source of behavior terms.  Uberon may be more useful in the future, but currently there is no integration with the spider ontology.  I have been remiss up to now in not including the tick gross anatomy ontology (TADS), but that should be added to the frontend before I start curating relevant papers.  Perhaps Uberon will cover Chelicerata by the time serious curation of Acari papers starts.  PATO is another question altogether – it’s an ontology of phenotypical qualities, which includes a lot of behavior qualities, but currently Arachnolingua is not focussed on capturing entity-quality annotations (e.g., Phenoscape).  Loading PATO is actually where the load died with a network timeout this morning.  Looks like I will need to support local caching for the backend, and I’m considering using a common cache directory for both the back and frontend for the time being (I’m in no rush to move the backend processing to AWS).

So, it looks like I’ll be making adding a local cache location to the source_ontology table in arachadmin.