Updating Sesame, starting the move to pyramid/postgresql

All technical stuff this week.  I’ve got the Spider-Behavior server code connected to travis-ci.  In the process of doing so, I ran into problem with Sesame, which turned out to be related to the maven setup.   It’s working and updated to the most recent 2.8 release.  This will allow some more changes, particularly some more formatting work on the narratives page.

While waiting for a response from the Sesame-users group, I started taking a serious look at moving arachadmin from web2py.  It looks like I’ll be able to move to a python3.5-pyramid-postgresql stack.  I have a first pass at the new table structure working with sqlalchemy, and will soon start the process of coding the transfer from mysql.  The experience is a lot closer to standard python development, rather than the somewhat cushioned ride offered by web2py.

Meanwhile, spending some time reviewing the opentree taxonomy against the world spider catalog.  So far nothing horrible, mostly groups added since 2009 and some unstable family splits.

Advertisements

No big news for Arachtober

Spider in leaf refuge

Unidentified spider constructing rolled leaf refuge.

Things weren’t really cleaned up enough for a big release announcement in October, though almost every page has had an update in the last month.  In addition to narrative ‘drill-down’ pages, all the other arachb.org/arachb/ARACHB_xxxxxxx identifiers that correspond to something in the database now resolve to simple pages.  The narrative drill-down pages (such as this) generate a table of their component events.  They can generate both html and json, the later is (in many cases) simply the ‘tunneled’ json that Owlbuilder serializes into an rdfs:comment annotation (as discussed here). These still need some tweaking and there is still information for ordering the events in the narrative.  This, table pagination, and some javascript cleanup remain before I could consider a release.  Then it will be time for some more curation and thoughts of updating the curation tool (python 3.5 and pyramid and mysql -> postgres).

Owlbuilder and server backend code are gradually migrating to Java 1.8.  Meanwhile Oracle’s cloud-related noises are making me nervous about the future of Java on non-Oracle environments.   Events conspire to make me nervous about Java every couple of years, so maybe it’s not time to worry yet.

Whither arachadmin and web2py?

IMG_20150612_174911730

As some of you know, my ‘day job’ is developing for the Open Tree of Life project.  I started development on arachadmin about the time I was starting with opentree.  My first opentree assignment was to add functionality to Rick Ree’s phylografter tool.  Briefly, phylografter provides a database of trees and tools for displaying and grafting portions of trees together.  The important part for now is that the tool was implemented using web2py and used mysql for the database.  At that point, the arachnolingua database consisted of a spreadsheet containing details of around 530 publications.  I had given thought to moving the publication spreadsheet to a proper database, but I didn’t see any obvious lightweight java solutions for building a gui-application to manage this data.  In theory I could have used an existing tool via an api, but I wasn’t sure where to look.  Meanwhile, phylografter presented me with a working model of a web-based gui interface to a relational database.  NOSQL didn’t seem like the right place to begin with since I was starting with a (fairly) well defined table in the spreadsheet.  It also meant I would be learning python and web2py on both projects, which seemed (and still seems) like a good way to ease task switching.

Starting with the publication table, I built up arachadmin using web2py, learning python and some more bits of javascript along the way.  The web2py environment is a very good way to get started with python-based web applications.  It provides lots of pieces for working with databases and web page building and enforces a model-view-controller architecture.  Even now, I wouldn’t recommend against it for certain types of projects.  Likewise, the rest of the opentree software team used web2py in multiple applications that where assembled to support the opentree website and support tools.

As time went by, I started to notice some problems with the way web2py did things.  Likewise, other opentree people, whose python expertise I greatly respect, started raising issues with the quirks of the web2py environment relative to straight python.  It turns out that our issues don’t exactly overlap (maybe because I’ve wrestled more with web2py’s database access layer (DAL)).  The Aforementioned opentree investigator actually recoded a piece of the web2py system using pyramid rather than web2py, so this would be a way to go.  Of course flask and django  would also be contenders, though at the present I’m liking pyramid as the ideal middle – flask can be built up to larger sites, but I would be converting a medium-sized application to start, so including a few more ‘batteries’ may help the process along.  Django would have everything I would need but would also include things I don’t (it seems focused on CRM type applications, not multifaceted data editors), and might make sharing the database with the owlbuilder backend tool more difficult.

So here are my three issues that will eventually force arachadmin to a new platform.

1. Testing at the unit and larger scales are an integral part of modern software development.  Unit testing isn’t impossible with web2py, but there are serious limited stemming from the way the environment for code is set up.  Python provides two popular unit test frameworks – doctest and unittest.  The later is a pretty standard implementation of the standard xunit testing framework implemented in multiple languages.  Doctest is simplified system that allows inclusion of test cases in a python function’s (or method’s) documentation string.  When doctests are run against a module, each function’s doc string is scanned for pattterns that indicate test followed by expected result.  This works, especially for code in the functional style, which can be tested with minimal or no context.  Testing code requiring set up context (state) is more difficult since putting code in the doc string doesn’t provide much opportunity to setup state.  Unfortunately, web2py functions depend on a lot of state (which also makes it difficult to use tools like pylint).  Most of the state is implicit, so you don’t know what to specify as imports to make a particular variable make sense to testing or code analysis tools.  My impression, which I expect to verify one way or another, is that other web environments make things more explicit, allowing unit tests and code analysis to be implemented and run more smoothly.

2. The Database Access Layer (DAL) is perhaps the most important component of web2py.  It provides a uniform access to a range of SQL and a few NOSQL database managers.  You define your tables and relations in model files (which are python source but with a lot of environment predefined).  The system is fairly robust to table modifications (and would probably be even more so if I were using postgres instead of mysql), and there are usually ways to dig your self out if things break.  In particular, if the system complains, you can dump and reload your database, and if the table specification in your model file agree with the schema in your SQL file, things almost always clear up.  The DAL is well integrated with the web page formatting, providing predefined templates for tables and individual column values, which are fairly easily customizable either through constructor arguments or tweaking the provided javascript and CSS files, which include the (once?) popular Bootstrap.

So what’s the problem?  I rather like that the DAL is not a full object-relational mapping.  The DAL uses objects, but they fairly transparently represent rows, sets of rows, and queries, rather than hiding the SQL behind user objects.  This also makes escaping to raw SQL easy to integrate with pure DAL code, which is sometimes necessary for speed if you’re doing big database exports or table updates.  The real problem with the DAL is its approach to indexing.  Every table must include an integer field as its primary key.  This probably makes a lot of sense to beginners, but it is less natural when you consider real data that comes with its own identifiers, rather than auto incremented integers.   For example, every named individual in OWL is identified by an IRI, so the string version of the IRI is the natural identifier for OWL objects.  Likewise, many publications have doi’s and academic authors are signing up for ORCID ids.  These external identifiers are frequently the best choice for indexing (despite being strings, there isn’t substantial cost to making your primary key a string).  Things are slightly more complicated in the case of arachnolingua because several tables may have either external IRIs or lack an external IRI and need a unique one generated.  Since it is best to let the database handle uniqueness checking, I currently need to maintain all the generated identifiers in a separate table to enforce the cross-table uniqueness.  Regardless, and certainly for imported terms, there is no intrinsic need for an integer index.

In addition, the integer index appears in other tables, such as those supporting many-to-many mappings where they would be completely useless for the DAL’s need for them.  The DAL also makes it very easy to retrieve rows via the integer primary key, relative to the select statements required for any other field.  This makes it very easy to let unnecessary dependence on the integer key into your code.

Overall, this dependence on integer primary key columns may make sense for small systems and databases that aren’t accessed by multiple applications, but I expect that more experienced database analysts would cringe.  I don’t consider myself to be any sort of serious database analyst, but the problem here is becoming more obvious with time.

3. Python 3 is the future of python.  This is no longer particularly controversial.  The official end of life date for python 2.7 is 2020. Although there was an effort to move web2py to python 3 (along with other changed), the effort, called web3py, has apparently stopped or terminated completely.  This has consequences both in 2020 and now, as all continued development of new features is focused on python 3.  Only important bug fixes (e.g., security) are being back ported to python 2.7.  The other major python web frameworks already support python 3.  They furthermore seem to be inhabiting more of a shared software ecosystem where improvements and widgets are more easily shared.  Web2py seems to be a bit of an outlier, and although there is a sizeable community at this point, the decline of python 2 will likely lead to the decline of the web2py community.

If the web2py developer(s) decide to build a python 3 release, it would seem the sooner they start the better.  This is both because python development will be a moving target, but also because getting a debugged, reliable python 3 version will be a prerequisite for many web2py users to consider the upgrade to python3.  The sooner web2py is upgraded, the more time before 2020 web2py user will have.

Given these problems, it’s time to start making plans to move arachadmin to pyramid (unless I discover a compelling reason to go with a different platform).  Currently I don’t plan to release a packaged arachadmin prior to the conversion, since the installation process will surely change along the way.  If there is interest, I could put up an installation on a small AWS instance, connected to a copy of the database that people could look at and explore (if it gets trashed, I can always upload a fresh copy).

I’ve been thinking about this for six months.  I am happy to have this off my chest.  If you have read this far, thank you for reading through this.  Comments are most welcome.

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:

Image

Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.

 

 

Back to curation

So, now that I’ve built the skeleton of a pipeline, I’m going back to the start to work on building it up.  First stop is the curation tool (arachadmin) and processing publications.  Up until now all the author information has been stored as an undecoded string more or less directly from the citation (e.g., “Last, first; Last, first; Last, initials”).  The tool does chew this up a bit in the publication list display, but none of that is getting stored or passed down the pipeline.  During the past week I’ve finished implementing an author table, that will include assigned ids (intended to be ORCIDs for authors that have them) or generated ids.  For the record, I do have an ORCID, it’s  0000-0001-6512-3296   though the information there is a bit out of date (try me on Impact Story, which I find easier to update).

A perennial problem with maintaining citations is the multiple ways an author’s name can appear.  To manage this, I’ve been implementing a simple system to allow me to merge the various ways an author’s name appears.  I’ve tried to design something that is flexible, particularly that it can be reversed (for example if there are two authors with the same last name and first initial, these can be separated if this is discovered after a merge).

The underlying database changes to support this involve tables for authors, merge records (allowing the original usages to be extracted), and a table of authors, publications, and listing orders (e.g., first author = 1, etc.).  The tables are there, but currently only the authors table is filled.  The merge information can be generated and after a number of hours of fiddling with forms and tables within forms, there is now a page for displaying candidate merges (generated on the fly, since the database has not been populated) and the results from the form are being throw at a particular page which is ignoring them.

Once the merging is taken care of, the next step will be to push these to owlbuilder with the promise of a proper publication page for the main webapp.

On the horizon is PostgreSQL – the day job is considering switching over from mysql to postgresql and I would be happy to follow them.  I got a fair amount of postgresql experience from my time with Phenoscape, and the future of mysql has held a bit of uncertainty ever since Oracle bought it along with the rest of Sun.  The experience will help since administration of postgresql is a bit more involved than mysql, for which administration is much closer to ‘simply working.’

NBO and all its pieces

This morning I tested the code that loads ontologies that arachnolingua references in assertions.  These are currently the spider biology ontology (SPD), the NCBI taxonomy (NCBITAXON), the Evidence Codes Ontology (ECO), and of course the NeuroBehavior Ontology (NBO).  I’ve wrestled with the NCBI taxonomy for years (e.g., as the basis of several parts of Phenoscape’s Vertebrate Taxonomy Ontology) and know that it is both large and filled with terms that will be either out of scope (e.g., environmental samples and probably most bacteria, viruses, and marine organisms) or simply unusable (e.g. children of ‘unclassified taxa’ such as Salticidae sp. RBCH101-04, a child of ‘unclassified Salticidae’), which exist only as a genetic sample, never to exhibit behavior.  In the frontend, I’ve attacked the scope of NCBI by parsing the RDF/XML OWL rendering (thanks to Chris Mungall), building a tree just long enough to determine trim down to the Arachnid clade and trimming any children of taxa named ‘unclassified XXX’.  This (incorrect) solution gives a list of about 11,000 taxa, small enough not to overload the drop-down list.  Of course many non-arachnid taxa will be needed for curation, whether as prey, predators, parasites, hosts, or simply substrates (e.g., plants).  I expect these will be introduced using the mechanism I’ve planned for Arachnids not currently in NCBI – the ability to add names, list an authority (e.g., World Catalog of Spiders), and provide enough parentage to link to a higher level group known to NCBI.

So NCBI is big, but a relatively well understood problem.  This morning NCBI took almost a minute to load from a load copy – this seems slow, but this was in owlbuilder, which was reading with the OWLAPI.  Not only slow, but memory hungry – I only got good performance after bumping the virtual size of the jvm to 4 GB.  After NCBI, owlbuilder attempted to load NBO and this is where things got messy.  Unlike the simple ‘build and filter a list of terms’ that the arachadmin editor uses, OWLAPI honors all the import statements, which means, as it turn out, that NBO triggers loading of the Chemical Entities of Biological Interest (CHEBI), the GeneOntology (GO), the Phenotypic Quality Ontology (PATO), and Uberon, an umbrella anatomy ontology that is intended to cover all bilateralia, but is currently more vertebrate than arthropod focused.  Of these, CHEBI will likely be of interest given the number of chemical communication papers I’ve identified, and GO is likely to be useful as an alternative source of behavior terms.  Uberon may be more useful in the future, but currently there is no integration with the spider ontology.  I have been remiss up to now in not including the tick gross anatomy ontology (TADS), but that should be added to the frontend before I start curating relevant papers.  Perhaps Uberon will cover Chelicerata by the time serious curation of Acari papers starts.  PATO is another question altogether – it’s an ontology of phenotypical qualities, which includes a lot of behavior qualities, but currently Arachnolingua is not focussed on capturing entity-quality annotations (e.g., Phenoscape).  Loading PATO is actually where the load died with a network timeout this morning.  Looks like I will need to support local caching for the backend, and I’m considering using a common cache directory for both the back and frontend for the time being (I’m in no rush to move the backend processing to AWS).

So, it looks like I’ll be making adding a local cache location to the source_ontology table in arachadmin.

 

 

 

It’s Spider Monday; no pretty pictures, but lots of database work

I’ve been focusing on the frontend Arachadmin webapp for several months, and now that I can, albeit slowly, add assertions that annotate behavior descriptions, it’s time to pay some more attention to generating OWL files.  Although there are still several housekeeping tasks (e.g., updating the OWLAPI and other libraries), I focused on how the database has changed recently.  The most important changes were adding code for loading Term beans (objects) from the term table as well as starting support for loading ontology files into the OWL tool.  Loading the support ontologies into the backend will allow full access to a reasoner to support filtering terms and axioms for inclusion in the final knowledgebase.  There are a number of support tables – those that define domains and authorities that were necessary for meaningfully loading terms that were loaded as well.  One of the drivers for adding support for these secondary tables so quickly was getting the testing framework setup for terms and support ontologies, both of which refer to the set of domains (e.g., taxonomy, anatomy, behavior) defined in the database and assigned to each term and support ontology.  This also mean that the test database needed to be updated to more closely reflect the schema structure in the working arachadmin database.  The updated test database has been uploaded to github.  As usual, I will upload updates of the full working database to dropbox and figshare.

Note: I’ve just published the arachadmin export on figshare.  If I understand how this works, this data now has a permanent (across future revisions) identifier, namely this.

 

Needless to say, I did enjoy the many #SpiderMonday photos people tweeted today.  My arthropod photography skills are gradually improving, maybe another season and I’ll have something to share.