Where are the spiders?

Some readers may be wondering why a blog titled arachnolingua has so little about spiders, to say nothing about the numerous other arachnids (you know: mites, ticks,  scorpions, harvestmen, etc.)   It’s true, this has been all about databases and OWL and website building – where’s the behavior and the beasties.  Well, those will have to wait until I have the infrastructure put together enough to start serious curation.  I have a nice collection of literature already collected and there is some curated data that is making it’s way into the database, though I’ve been cautious in case something breaks before things settle down.  There is some curated information up on the website – if you go to the testing page, and hit the ‘list all publications’ button, it will display a table with identifiers for the 559 publications already in the database.  If you scroll down, about half-way through the list, you’ll start seeing doi’s – feel free to check them out.  The other publications need arachb identifiers as they don’t seem to have doi’s.

Meanwhile, if you’re looking for spider behavior, have a look at Spiderbytes, a new blog by Catherine Scott, which so far is turning out to be a nice mix of natural history and taxonomy.

Advertisements

Ontologies managed (for now); java takes me back to test driven development

Although I was quiet last weekend, I was quiet busy, switching over ontologies to local loading and copying in arachadmin.  I discovered that, although NCBITaxon was the largest, GO and Chebi were sizable ontologies as well.  I should be able to trim both of them down, especially on arachadmin side, since most of their terms won’t be relevant to spider behavior, though that hasn’t happened yet.  After adding the four additional ontologies to the ontology_source table, I found that doing a full download and update seems to take about 15 minutes, which is slow, but since it should happen rarely, that’s manageable.

Then I switched over to owlbuilder and implemented ontology loading that used the ontology_source table for locating files.  After committing this, I was surprised to receive a build-failure email from Travis-CI.  I had forgotten that I had created an account linked to the owlbuilder github repository.  I did this out of curiosity when people discussed using it on a project call for my day job.   Since I hadn’t made any owlbuilder changes for a couple of months, both the mail and the failure were unexpected. 

Setting up Travis had actually been very easy – I just added a simple description file to the root of the project repository, linked my github account, and Travis could pull the files, find out that the build was maven-based and attempt to build and test.  Since running the tests locally generally worked (I like to, whenever possible, leave/commit things only when the tests pass), I wondered for a moment why they weren’t working, then remembered that Travis didn’t have access to the mysql database where everything, including a test database resided.

So, then the question was whether there was a way to make the tests pass in a meaningful way in the absence of the test database.  I pretty quickly came up with a solution involving mock objects and abstractions for the database connection as well as the ResultSet objects that the JDBC (library connecting java to SQL databases) use to return the results of queries.  Briefly, since I had a DBConnection class that wrapped around the JDBC connection object, I just built an abstracted interface that supported everything I defined for my connection class and then defined a mock object that answered to the same interface.  When the static method of DBConnection was called, it returned a DBConnection when the attempt to connect succeeded and a MockConnection when the attempt failed.  The MockConnection simply returned MockResults objects that implemented the same methods as a very thin wrapper (DBResults) that wrapped the ResultSet from JDBC.  Not really that complicated, but (as is typical for java) a lot of updating method signatures, as well as defining the methods in the MockConnection that returned MockResults.  Fortunately, MockResults are only used to fill ‘bean’ objects that represent each of the kinds of things represented by rows in the most important database tables.  Got a full set of test cases to run on with the mock objects by Tuesday, and I’m finishing up the updates to the test database so it passes exactly the same test methods as the mock data (which I actually more completely implemented than the test database – which had been lying fallow for a few months now).  In someways, the mock data test methods have driven the database test methods which will in turn affect the ‘real’ code used to build things.

Although there is still a taxonomy table in the database, I’ve determined that there is no reason for owlbuilder to actually read it – its present purpose is simply to hold names that don’t currently exist in NCBI, which can then be merged in as terms with arachnolingua identifiers.  Besides the identifiers, the rows for these terms will also specify a different authority (I assume the World Spider Catalog).  Using authority fields like this will allow me to generate a taxonomy status page without resorting to treat term identifiers in a non-opaque manner (e.g., looking for the domain specified in a URI).  So I’ve taken for reading the taxa out of the backend.  The code for loading terms by id is implemented, as are assertions and their associated participants.  There are separate methods for reading the primary participant (the spider most of the time) and secondary participants (e.g., prey or substrate).

If it is eventually determined that BFO compliant ontologies will not support the notion of process qualities, I assume that qualities of some sort will be attached to the primary participant (e.g. fast movement of legs vs. fast legs moving).  This is part of the reason I’m not rushing to add qualities here.  Other reasons are that my focus has been more on the structure (topography) of behavior (e.g., the relation of the component pieces) rather than qualities of the whole behavior.  This also gives PATO (the phenotype qualities ontology) some more time to mature.

 

 

 

 

NBO and all its pieces

This morning I tested the code that loads ontologies that arachnolingua references in assertions.  These are currently the spider biology ontology (SPD), the NCBI taxonomy (NCBITAXON), the Evidence Codes Ontology (ECO), and of course the NeuroBehavior Ontology (NBO).  I’ve wrestled with the NCBI taxonomy for years (e.g., as the basis of several parts of Phenoscape’s Vertebrate Taxonomy Ontology) and know that it is both large and filled with terms that will be either out of scope (e.g., environmental samples and probably most bacteria, viruses, and marine organisms) or simply unusable (e.g. children of ‘unclassified taxa’ such as Salticidae sp. RBCH101-04, a child of ‘unclassified Salticidae’), which exist only as a genetic sample, never to exhibit behavior.  In the frontend, I’ve attacked the scope of NCBI by parsing the RDF/XML OWL rendering (thanks to Chris Mungall), building a tree just long enough to determine trim down to the Arachnid clade and trimming any children of taxa named ‘unclassified XXX’.  This (incorrect) solution gives a list of about 11,000 taxa, small enough not to overload the drop-down list.  Of course many non-arachnid taxa will be needed for curation, whether as prey, predators, parasites, hosts, or simply substrates (e.g., plants).  I expect these will be introduced using the mechanism I’ve planned for Arachnids not currently in NCBI – the ability to add names, list an authority (e.g., World Catalog of Spiders), and provide enough parentage to link to a higher level group known to NCBI.

So NCBI is big, but a relatively well understood problem.  This morning NCBI took almost a minute to load from a load copy – this seems slow, but this was in owlbuilder, which was reading with the OWLAPI.  Not only slow, but memory hungry – I only got good performance after bumping the virtual size of the jvm to 4 GB.  After NCBI, owlbuilder attempted to load NBO and this is where things got messy.  Unlike the simple ‘build and filter a list of terms’ that the arachadmin editor uses, OWLAPI honors all the import statements, which means, as it turn out, that NBO triggers loading of the Chemical Entities of Biological Interest (CHEBI), the GeneOntology (GO), the Phenotypic Quality Ontology (PATO), and Uberon, an umbrella anatomy ontology that is intended to cover all bilateralia, but is currently more vertebrate than arthropod focused.  Of these, CHEBI will likely be of interest given the number of chemical communication papers I’ve identified, and GO is likely to be useful as an alternative source of behavior terms.  Uberon may be more useful in the future, but currently there is no integration with the spider ontology.  I have been remiss up to now in not including the tick gross anatomy ontology (TADS), but that should be added to the frontend before I start curating relevant papers.  Perhaps Uberon will cover Chelicerata by the time serious curation of Acari papers starts.  PATO is another question altogether – it’s an ontology of phenotypical qualities, which includes a lot of behavior qualities, but currently Arachnolingua is not focussed on capturing entity-quality annotations (e.g., Phenoscape).  Loading PATO is actually where the load died with a network timeout this morning.  Looks like I will need to support local caching for the backend, and I’m considering using a common cache directory for both the back and frontend for the time being (I’m in no rush to move the backend processing to AWS).

So, it looks like I’ll be making adding a local cache location to the source_ontology table in arachadmin.

 

 

 

It’s Spider Monday; no pretty pictures, but lots of database work

I’ve been focusing on the frontend Arachadmin webapp for several months, and now that I can, albeit slowly, add assertions that annotate behavior descriptions, it’s time to pay some more attention to generating OWL files.  Although there are still several housekeeping tasks (e.g., updating the OWLAPI and other libraries), I focused on how the database has changed recently.  The most important changes were adding code for loading Term beans (objects) from the term table as well as starting support for loading ontology files into the OWL tool.  Loading the support ontologies into the backend will allow full access to a reasoner to support filtering terms and axioms for inclusion in the final knowledgebase.  There are a number of support tables – those that define domains and authorities that were necessary for meaningfully loading terms that were loaded as well.  One of the drivers for adding support for these secondary tables so quickly was getting the testing framework setup for terms and support ontologies, both of which refer to the set of domains (e.g., taxonomy, anatomy, behavior) defined in the database and assigned to each term and support ontology.  This also mean that the test database needed to be updated to more closely reflect the schema structure in the working arachadmin database.  The updated test database has been uploaded to github.  As usual, I will upload updates of the full working database to dropbox and figshare.

Note: I’ve just published the arachadmin export on figshare.  If I understand how this works, this data now has a permanent (across future revisions) identifier, namely this.

 

Needless to say, I did enjoy the many #SpiderMonday photos people tweeted today.  My arthropod photography skills are gradually improving, maybe another season and I’ll have something to share.