NBO and all its pieces

This morning I tested the code that loads ontologies that arachnolingua references in assertions.  These are currently the spider biology ontology (SPD), the NCBI taxonomy (NCBITAXON), the Evidence Codes Ontology (ECO), and of course the NeuroBehavior Ontology (NBO).  I’ve wrestled with the NCBI taxonomy for years (e.g., as the basis of several parts of Phenoscape’s Vertebrate Taxonomy Ontology) and know that it is both large and filled with terms that will be either out of scope (e.g., environmental samples and probably most bacteria, viruses, and marine organisms) or simply unusable (e.g. children of ‘unclassified taxa’ such as Salticidae sp. RBCH101-04, a child of ‘unclassified Salticidae’), which exist only as a genetic sample, never to exhibit behavior.  In the frontend, I’ve attacked the scope of NCBI by parsing the RDF/XML OWL rendering (thanks to Chris Mungall), building a tree just long enough to determine trim down to the Arachnid clade and trimming any children of taxa named ‘unclassified XXX’.  This (incorrect) solution gives a list of about 11,000 taxa, small enough not to overload the drop-down list.  Of course many non-arachnid taxa will be needed for curation, whether as prey, predators, parasites, hosts, or simply substrates (e.g., plants).  I expect these will be introduced using the mechanism I’ve planned for Arachnids not currently in NCBI – the ability to add names, list an authority (e.g., World Catalog of Spiders), and provide enough parentage to link to a higher level group known to NCBI.

So NCBI is big, but a relatively well understood problem.  This morning NCBI took almost a minute to load from a load copy – this seems slow, but this was in owlbuilder, which was reading with the OWLAPI.  Not only slow, but memory hungry – I only got good performance after bumping the virtual size of the jvm to 4 GB.  After NCBI, owlbuilder attempted to load NBO and this is where things got messy.  Unlike the simple ‘build and filter a list of terms’ that the arachadmin editor uses, OWLAPI honors all the import statements, which means, as it turn out, that NBO triggers loading of the Chemical Entities of Biological Interest (CHEBI), the GeneOntology (GO), the Phenotypic Quality Ontology (PATO), and Uberon, an umbrella anatomy ontology that is intended to cover all bilateralia, but is currently more vertebrate than arthropod focused.  Of these, CHEBI will likely be of interest given the number of chemical communication papers I’ve identified, and GO is likely to be useful as an alternative source of behavior terms.  Uberon may be more useful in the future, but currently there is no integration with the spider ontology.  I have been remiss up to now in not including the tick gross anatomy ontology (TADS), but that should be added to the frontend before I start curating relevant papers.  Perhaps Uberon will cover Chelicerata by the time serious curation of Acari papers starts.  PATO is another question altogether – it’s an ontology of phenotypical qualities, which includes a lot of behavior qualities, but currently Arachnolingua is not focussed on capturing entity-quality annotations (e.g., Phenoscape).  Loading PATO is actually where the load died with a network timeout this morning.  Looks like I will need to support local caching for the backend, and I’m considering using a common cache directory for both the back and frontend for the time being (I’m in no rush to move the backend processing to AWS).

So, it looks like I’ll be making adding a local cache location to the source_ontology table in arachadmin.





Assertions, participants, and other artifacts

I’ve been quiet for over a month now, though not inactive with Arachnolingua.  I spent the first couple of weeks working through the forms chapter in the web2py book which was helpful both in the context of Arachadmin as well as a couple of issues that came up with the day job.  This was all in support of making a reasonably useful page for entering assertions, which are the primary records for generating behavior instances in arachnolingua.  Along with the web2py review, there was a fair amount of database redesign as I worked through the relation between participants (animals, their parts, and environmental substrates).  Participants might be individuals or quantified (e.g., some Habronattus californicus, portion of substance granite).  While animal and term terms come easily from existing taxonomy and anatomy ontologies, environmental participants will require pulling terms from one or more environment related ontologies.  Since it looks like I have an invite to the next Phenotype RCN in February, which will focus on environment, I should have an opportunity to size up the options.

Once I had a basic version of the assertion page up (no screenshots yet, it’s very much a work in progress), I started realizing just how long the drop down list for taxa would be.  I’ve not doing anything fancy with text completion (haven’t had any success with the text completion widget in web2py), so making taxon selection more manageable has focused on reducing the length of the list by filtering out irrelevant terms.  NCBI taxonomy, as anyone who has worked with term exports knows, contains, in addition to Linnean terms, identifiers for incompletely identified samples (e.g., Lycosidae sp.) which will often include lab identifiers.  Since these deposits are very unlikely to form the basis of any behavior observations, I’ve implemented filtering, currently just removing any children of nodes with labels of the form ‘unclassified x’.  This does not remove all the problem terms, the messier stage of filtering out terms by regex matching against labels will wait for another day.