Note to self and others

Arachnolingua focuses its OWL expressions on the claim (= assertion, statement) of an individual behavior event or class expression and properties should start there and work out.  Thus: courtship_event –has_participant–>palp–part_of–>male–has_type–>Habronattus sp.  There may be consequences for this decision (especially for class level statements), but it is better to be consistent and document the design decision here for now.

This should eventually make it into the ‘primer’ documents for the curation tool and the database schema as well.  I wonder if there are any tools in Protege for looking at graphs of individuals – maybe lego?

Advertisements

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:

Image

Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.

 

 

Ontologies managed (for now); java takes me back to test driven development

Although I was quiet last weekend, I was quiet busy, switching over ontologies to local loading and copying in arachadmin.  I discovered that, although NCBITaxon was the largest, GO and Chebi were sizable ontologies as well.  I should be able to trim both of them down, especially on arachadmin side, since most of their terms won’t be relevant to spider behavior, though that hasn’t happened yet.  After adding the four additional ontologies to the ontology_source table, I found that doing a full download and update seems to take about 15 minutes, which is slow, but since it should happen rarely, that’s manageable.

Then I switched over to owlbuilder and implemented ontology loading that used the ontology_source table for locating files.  After committing this, I was surprised to receive a build-failure email from Travis-CI.  I had forgotten that I had created an account linked to the owlbuilder github repository.  I did this out of curiosity when people discussed using it on a project call for my day job.   Since I hadn’t made any owlbuilder changes for a couple of months, both the mail and the failure were unexpected. 

Setting up Travis had actually been very easy – I just added a simple description file to the root of the project repository, linked my github account, and Travis could pull the files, find out that the build was maven-based and attempt to build and test.  Since running the tests locally generally worked (I like to, whenever possible, leave/commit things only when the tests pass), I wondered for a moment why they weren’t working, then remembered that Travis didn’t have access to the mysql database where everything, including a test database resided.

So, then the question was whether there was a way to make the tests pass in a meaningful way in the absence of the test database.  I pretty quickly came up with a solution involving mock objects and abstractions for the database connection as well as the ResultSet objects that the JDBC (library connecting java to SQL databases) use to return the results of queries.  Briefly, since I had a DBConnection class that wrapped around the JDBC connection object, I just built an abstracted interface that supported everything I defined for my connection class and then defined a mock object that answered to the same interface.  When the static method of DBConnection was called, it returned a DBConnection when the attempt to connect succeeded and a MockConnection when the attempt failed.  The MockConnection simply returned MockResults objects that implemented the same methods as a very thin wrapper (DBResults) that wrapped the ResultSet from JDBC.  Not really that complicated, but (as is typical for java) a lot of updating method signatures, as well as defining the methods in the MockConnection that returned MockResults.  Fortunately, MockResults are only used to fill ‘bean’ objects that represent each of the kinds of things represented by rows in the most important database tables.  Got a full set of test cases to run on with the mock objects by Tuesday, and I’m finishing up the updates to the test database so it passes exactly the same test methods as the mock data (which I actually more completely implemented than the test database – which had been lying fallow for a few months now).  In someways, the mock data test methods have driven the database test methods which will in turn affect the ‘real’ code used to build things.

Although there is still a taxonomy table in the database, I’ve determined that there is no reason for owlbuilder to actually read it – its present purpose is simply to hold names that don’t currently exist in NCBI, which can then be merged in as terms with arachnolingua identifiers.  Besides the identifiers, the rows for these terms will also specify a different authority (I assume the World Spider Catalog).  Using authority fields like this will allow me to generate a taxonomy status page without resorting to treat term identifiers in a non-opaque manner (e.g., looking for the domain specified in a URI).  So I’ve taken for reading the taxa out of the backend.  The code for loading terms by id is implemented, as are assertions and their associated participants.  There are separate methods for reading the primary participant (the spider most of the time) and secondary participants (e.g., prey or substrate).

If it is eventually determined that BFO compliant ontologies will not support the notion of process qualities, I assume that qualities of some sort will be attached to the primary participant (e.g. fast movement of legs vs. fast legs moving).  This is part of the reason I’m not rushing to add qualities here.  Other reasons are that my focus has been more on the structure (topography) of behavior (e.g., the relation of the component pieces) rather than qualities of the whole behavior.  This also gives PATO (the phenotype qualities ontology) some more time to mature.