The pace of Arachnolingua development has picked up a bit. After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious. The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog). That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI. Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required. Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool. Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term. This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.
I also followed through on a long overdue change that rippled throughout the implementation. I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence. The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion. So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.
The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’). Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate. So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.
After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims. In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:
Changes like this are ugly, but only get uglier the longer they are put off.
Finally, I’ve spend some time on the query component of the website. As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior). The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates. Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point). The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time. I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug). I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.
I also added lexical filtering to terms from NCBI. There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species. As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes. This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions. The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions. I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen. So best to filter these out consistently in both curation and OWL generation. Removing these terms did provide some query speedup for the worst cases.
Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough. I hope my northern hemisphere readers are enjoying the spring as I have.