Note to self and others

Arachnolingua focuses its OWL expressions on the claim (= assertion, statement) of an individual behavior event or class expression and properties should start there and work out.  Thus: courtship_event –has_participant–>palp–part_of–>male–has_type–>Habronattus sp.  There may be consequences for this decision (especially for class level statements), but it is better to be consistent and document the design decision here for now.

This should eventually make it into the ‘primer’ documents for the curation tool and the database schema as well.  I wonder if there are any tools in Protege for looking at graphs of individuals – maybe lego?

Advertisements

ABS 2015

Still traveling – left for Alaska on June 3, attended Animal Behavior Society 2015 meeting 10-14 June, stop over in California then on my way to the American Arachnological Society meeting starting on the 19th.  As usual, Schizocosa was somewhat overrepresented (not just the Uetz and Hebets labs), but widows and tarantulas (African) were also represented.  I also gave a talk, which reviewed the poster results from last year and provided (I hope) motivation for the implementation course I have taken for individuals and narratives.  You can find it here.

Meanwhile, lots of work on OwlBuilder, but things aren’t ready to move on to an update to the web presence.  I do have a new AWS server in reserve for this however.

Curation Update

As promised a few weeks back, I recently (more or less) completed some pretty substantial changes in the curation tool.  Specifically, I have changed the process of specifying ‘participants’ (actors, objects) associated with behavior events.  Participants typically include a spider or its anatomical parts and possibly another organism (mate, prey) or a substrate (silk, plant parts, a surface).  The old version simply allowed choice of a term for (spider) taxon and anatomy as the primary participant and a secondary substrate participant.  This allowed capturing simple events, but was completely inadequate for capturing social or predation interactions.  The old version did have the virtue of simplicity, and may return in some form of ‘wizard tool’ in the future.

The new version runs rather closer to the OWL that ultimately gets generated.  Claims consist of a ‘root’ element that captures the behavior term and a set of participants, each a tree structure (potentially, though most commonly a chain) of participant ‘elements’ that correspond to OWL classes, linked by arcs corresponding to OWL predicates (aka OBO relations).  The tree for a participant more or less corresponds to an OWL class or individual expression.  The interface still requires a great deal of polish, but I’ve added a simple d3 graph display that should make the structure of what’s being created a little clearer.

claim_page_bottom

The ‘table of associated participants’ lists each participant using pidgin English and indicates its level of participation: ‘active participant’ or (passive) ‘participant’.  The graph below the table shows the relations between the participant elements (squares) and the root with predicates indicated by the color of connecting lines (red = active participant, orange = participant, blue = part of).   The link table on the right will (when I’ve finished implementing it) show more details about the ‘active’ participant element.  This table, along with color changing when an element is selected, remain to be implemented, though the selection mechanism is there.  Like many d3 graphs, node placement is ‘force-driven’ and can be tweaked by dragging with your pointing device.  The graph display is largely a transformation and simplification of this molecule example.

The corresponding upper part of the page looks like this:

claim_page_top

This has been changed relatively little, except that drop selection fields for terms from taxon, anatomy, and environment terms have disappeared, they now appear during the element creation dialog that appears as an ajax form in the participant element section.  The text fields for publication taxon, anatomy and substrate remain.  These text metadata fields are associated with the participant as a whole, rather than the corresponding elements.

I’ve started writing some tests for the curation tool.  Adding Selenium to walk through the screens was relatively straightforward.  Unit tests have been rather more of a challenge, due, in part, to the eccentricities of the web2py framework.  But the issues I’ve encountered with web2py are large enough to deserve a post of their own.  Suffice it to say, there are some doctests in the code, but traditional unit tests remain in the future.

This is what I had been working on, as time permitted, over the second half of 2014.  I’ve now switched back to the owl generation side, which, as you may remember is a java tool that pulls from the database to build using the OWLAPI.  There is a lot of work to do there as well.  Since I switched I’ve started using the EclEmma test coverage module to try to improve the rather pathetic test harness I had for the owl generation.  At least unit testing has been fairly straightforward for now.

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:

Image

Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.

 

 

Slow spell

Not as much to report recently, though there has been some Arachnolingua work.  The promised update of NCBI taxonomy hasn’t made it to release yet, so I’ve been concentrating on supporting non-NCBI taxa on both the curation side as well as the OWL generation side.  This will probably also be the mechanism I use for introducing taxa that aren’t Arachnids but relevant to Arachnid behavior (predator, prey, substrate, etc.) since I don’t really want to manage all of NCBI (or even NCBI pruned of all the non-organismal taxonomic terms).  The changes to Arachadmin are partially done and working, though merging these into the set of available terms at least needs some testing with actual curation of behavior.

I’ve also added the Environmental Ontology (EnvO) and the Population and Community Ontology (PCO) to the set of support ontologies.  The former will, hopefully provide some substrates as well as conditions that might affect behavior.  The PCO provides terms for populations (and groups of e.g., social spiders) and will provide a link between population level processes (e.g, juvenile dispersal) and individual behaviors (e.g., ballooning).  I have been somewhat involved in the later, and had the pleasure of meeting the main developer of EnvO, Pier Luigi Buttigieg, at the February RCN summit.

The other task I’ve been working on is getting the OwlTools incorporated into the OwlBuilder generator.  I seem to have resolved my maven issue, at least for local builds.  I still need to figure out if I can define the project to allow maven to find the OwlBuilder repository somewhere when I try building on Travis.  That still remains, as does integrating the ontology release tool with my own code.  Although arachb will never actually be an OBO ontology, many of the OBO foundry standards are relevant and appropriate and if I can automate compliance with as many of their standards as possible, it will be a good thing.

Finally not much progress on the annotation semantics work.

Arachnolingua was not affected by the HeartBleed bug we all heard so much about this past week.  Arguably this is for all the wrong reasons.   Arachnolingua doesn’t use SSL directly or even have certificates;  if the resource were actually authoritative about something, it would be worthwhile to support secure communications to prevent man-in-the-middle spoofing.  Likewise, Arachadmin could be set up as a real web tool with authorized curators and the like, but the need for such has not presented itself, though if the right people were interested…  So, because SSL isn’t used, Arachnolingua wasn’t affected.  Nevertheless, I did update the server to the patched version of the SSL library.  Meanwhile, it looks like I should update my other passwords (AWS, Namecheap, here, etc.).  Some of these have already been fixed.

I’m already looking forward to the summer – Evolution2014/iEvoBio and Animal Behavior 2014.  Plan is definitely to do a lightning talk for iEvoBio and either a different lightning talk (different audience and emphesis) for Animal Behavior, or maybe a poster.  Arachnolingua is alive and well, even if the last few weeks have been a bit of hibernation.

 

 

 

 

 

RCN summit and some updates

I haven’t posted in a while because I was at the Phenotype RCN summit the week before last.  The week started with two workshops covering the Population and Community Ecology Ontology (PCO) and the Biological Collections Ontology (BCO).  I have been interested in the PCO for several years, both because my PhD project involved population level processes (‘culture’ in a population of birds) as well as population related processes in spiders (social spiders, but also predation and predator defense which, by involving multiple individuals are at the boundary of behavior and population/community processes).

One of the high points of the RCN summit was seeing Martin Ramirez again.   He has been continuing to work with the spider anatomy ontology and showed an interesting comparative analysis of complexity of reproductive structures using derived species-level anatomy ontologies.  I also showed him a bit of arachnolingua and we left open the possibility of more collaboration in the coming year.

I also got to have some brief discussions with Chris Mungall about NCBI taxonomy and the possibility of more frequent updates to the generated OWL from the source.  I expect that other projects would benefit from more frequent updates since not many projects are in the position to support a separate taxonomy ontology.

There has also been progress on followup workshops for the NBO.  More here and in the other blog as updates warrant.

Since I returned, I’ve had a couple of hours to clean up a number of issues on the web page: most significantly, I’ve added javascript code to turn the urls in the ethogram results table (currently just the publications) into links.  I’ve also generated simple publication labels (author list + publication year) so tracking the results doesn’t require decoding a doi.  The publication column now shows the label and links to the doi (which works for both publications with annotations in the KB).  Eventually I will change the link to point to a publication summary page which will link to the doi when it is available, but that will remain a low priority until I start annotating publications without doi’s.

An unexpected, but reasonable side effect is the change in behavior of the display of taxa without annotations.  They now treat the purls as links, which unfortunately don’t resolve.  This doesn’t seem to be something to fight, so I’ll convert them into links back to the NCBI taxon page on the fly (just extract the taxon id and generate a new URL).

A more immediate issue is getting the descriptions of the behavior to display.  This isn’t hard, but I’m taking a day or two to give some thought to the best way to model this (the simplest would be an rdfs:label or rdfs:comment on the annotation, but it would be better to use some IAO vocabulary to attach it to the node that represents the individual part of the publication being annotated).

I’ve pulled out the hardcopies of publications I made a year or so back; getting ready for some more sustained curation activity (as well as more changes to arachadmin).

 

 

 

It’s Darwin Day – I’ll celebrate by updating the site

I’ve spent a couple of hours cleaning up the arachnolingua site.  In particular, the link to the old ontology file is gone from the top menu bar – that file hasn’t been valid in nearly a year and even the current file doesn’t deserve that level of billing.  If you want to see the OWL file, there is a link on the project page (available from the ‘about’ menu item).  The new OWL file is, I’m happy to announce, a reasoned over version, with inferred axioms covering  inferences of subclass, equivalent class, and class assertion being added to the ontology.   ELK didn’t complain (unlike when it reasons over the mashup of supporting ontologies) and a quick inspection in Protege showed no apparent changes when the resulting ontology was re-reasoned with FACT++ (unlike when the unreasoned ontology was loaded – several top level classes disappeared after reasoning).  The lack of improvement resulting from reasoning in Protege doesn’t say much about FACT++ vs. ELK, simply that the KB is expressively simple enough that it is covered by the portion of OWL-EL that ELK reasons over.  It’s also still quite small, so the speed advantage of ELK isn’t very obvious in the second pass.  Hopefully arachb will keep growing past the point where this is no longer the case.

The project page has also been updated to link to the source ontologies directly rather than the MIREOT’ed owl files that went with the old KB file.  Perhaps these and other questionable links have something to do with all the hits on the projects page (and only the projects page) from Russian sources that my logs the past week or so have been showing.

The other big change is the attention I’ve given to the taxonomy status page.  First, a word of explanation: the taxonomy status page is a list of taxa I’ve encountered during curation (so the publications listed on this page have undergone some review and behavior annotation and will likely be the next ones to appear in the KB) that do not appear in the NCBI taxonomy.  The list here is over a year old, but none of the names listed here have appeared in the intervening months.  In some cases (e.g., synonymy) this is to be expected.  In any case, I’m using the World Spider Catalog as my authority for cases where names don’t appear in NCBI.  Because the WSC text pages don’t contain full species names and lists of genera are split by family, I’ve used wikipedia and EoL to get family names for these taxa.  Looks like this may trigger some EoL contributions on my part.

In any case, enjoy and feel free to comment on what you see, suggest papers to review (or bump up the queue) or any other suggestions.