Note to self and others

Arachnolingua focuses its OWL expressions on the claim (= assertion, statement) of an individual behavior event or class expression and properties should start there and work out.  Thus: courtship_event –has_participant–>palp–part_of–>male–has_type–>Habronattus sp.  There may be consequences for this decision (especially for class level statements), but it is better to be consistent and document the design decision here for now.

This should eventually make it into the ‘primer’ documents for the curation tool and the database schema as well.  I wonder if there are any tools in Protege for looking at graphs of individuals – maybe lego?

Advertisements

Exploring Alternatives

One of the nice parts about the switch to pyramid is that it supports a range of alternatives.  For example, at the moment I am trying out forms in WTForms after converting the templates to jinja2.  In some ways the arachcurator tool has become a hybrid of technologies from pylons and from flask.  I guess this means if pylons disappears tomorrow, I won’t be starting from scratch.  It was a gut-level instinct call that lead me to prefer pyramid over flask in the first place.

Meanwhile, all the browsing code is built, tested, and now back in a state of flux as I shift forms technology.  There are 3djs graphs for both individuals and their part as well as expanded display trees on the claim page.  I have added users (mostly for change tracking) and three levels of authorization with the hope to put an arachcurator server up on AWS at some point (sharing more than code and demonstrating replicability I hope).  I even went so far as to add password encoding with passlib.

Meanwhile, as discussed here, I’ve started building a T-box vocabulary for spider behavior.  This will extend what can be said in annotations, in a way that changes here will only simplify.

 

 

Updating Sesame, starting the move to pyramid/postgresql

All technical stuff this week.  I’ve got the Spider-Behavior server code connected to travis-ci.  In the process of doing so, I ran into problem with Sesame, which turned out to be related to the maven setup.   It’s working and updated to the most recent 2.8 release.  This will allow some more changes, particularly some more formatting work on the narratives page.

While waiting for a response from the Sesame-users group, I started taking a serious look at moving arachadmin from web2py.  It looks like I’ll be able to move to a python3.5-pyramid-postgresql stack.  I have a first pass at the new table structure working with sqlalchemy, and will soon start the process of coding the transfer from mysql.  The experience is a lot closer to standard python development, rather than the somewhat cushioned ride offered by web2py.

Meanwhile, spending some time reviewing the opentree taxonomy against the world spider catalog.  So far nothing horrible, mostly groups added since 2009 and some unstable family splits.

Whither arachadmin and web2py?

IMG_20150612_174911730

As some of you know, my ‘day job’ is developing for the Open Tree of Life project.  I started development on arachadmin about the time I was starting with opentree.  My first opentree assignment was to add functionality to Rick Ree’s phylografter tool.  Briefly, phylografter provides a database of trees and tools for displaying and grafting portions of trees together.  The important part for now is that the tool was implemented using web2py and used mysql for the database.  At that point, the arachnolingua database consisted of a spreadsheet containing details of around 530 publications.  I had given thought to moving the publication spreadsheet to a proper database, but I didn’t see any obvious lightweight java solutions for building a gui-application to manage this data.  In theory I could have used an existing tool via an api, but I wasn’t sure where to look.  Meanwhile, phylografter presented me with a working model of a web-based gui interface to a relational database.  NOSQL didn’t seem like the right place to begin with since I was starting with a (fairly) well defined table in the spreadsheet.  It also meant I would be learning python and web2py on both projects, which seemed (and still seems) like a good way to ease task switching.

Starting with the publication table, I built up arachadmin using web2py, learning python and some more bits of javascript along the way.  The web2py environment is a very good way to get started with python-based web applications.  It provides lots of pieces for working with databases and web page building and enforces a model-view-controller architecture.  Even now, I wouldn’t recommend against it for certain types of projects.  Likewise, the rest of the opentree software team used web2py in multiple applications that where assembled to support the opentree website and support tools.

As time went by, I started to notice some problems with the way web2py did things.  Likewise, other opentree people, whose python expertise I greatly respect, started raising issues with the quirks of the web2py environment relative to straight python.  It turns out that our issues don’t exactly overlap (maybe because I’ve wrestled more with web2py’s database access layer (DAL)).  The Aforementioned opentree investigator actually recoded a piece of the web2py system using pyramid rather than web2py, so this would be a way to go.  Of course flask and django  would also be contenders, though at the present I’m liking pyramid as the ideal middle – flask can be built up to larger sites, but I would be converting a medium-sized application to start, so including a few more ‘batteries’ may help the process along.  Django would have everything I would need but would also include things I don’t (it seems focused on CRM type applications, not multifaceted data editors), and might make sharing the database with the owlbuilder backend tool more difficult.

So here are my three issues that will eventually force arachadmin to a new platform.

1. Testing at the unit and larger scales are an integral part of modern software development.  Unit testing isn’t impossible with web2py, but there are serious limited stemming from the way the environment for code is set up.  Python provides two popular unit test frameworks – doctest and unittest.  The later is a pretty standard implementation of the standard xunit testing framework implemented in multiple languages.  Doctest is simplified system that allows inclusion of test cases in a python function’s (or method’s) documentation string.  When doctests are run against a module, each function’s doc string is scanned for pattterns that indicate test followed by expected result.  This works, especially for code in the functional style, which can be tested with minimal or no context.  Testing code requiring set up context (state) is more difficult since putting code in the doc string doesn’t provide much opportunity to setup state.  Unfortunately, web2py functions depend on a lot of state (which also makes it difficult to use tools like pylint).  Most of the state is implicit, so you don’t know what to specify as imports to make a particular variable make sense to testing or code analysis tools.  My impression, which I expect to verify one way or another, is that other web environments make things more explicit, allowing unit tests and code analysis to be implemented and run more smoothly.

2. The Database Access Layer (DAL) is perhaps the most important component of web2py.  It provides a uniform access to a range of SQL and a few NOSQL database managers.  You define your tables and relations in model files (which are python source but with a lot of environment predefined).  The system is fairly robust to table modifications (and would probably be even more so if I were using postgres instead of mysql), and there are usually ways to dig your self out if things break.  In particular, if the system complains, you can dump and reload your database, and if the table specification in your model file agree with the schema in your SQL file, things almost always clear up.  The DAL is well integrated with the web page formatting, providing predefined templates for tables and individual column values, which are fairly easily customizable either through constructor arguments or tweaking the provided javascript and CSS files, which include the (once?) popular Bootstrap.

So what’s the problem?  I rather like that the DAL is not a full object-relational mapping.  The DAL uses objects, but they fairly transparently represent rows, sets of rows, and queries, rather than hiding the SQL behind user objects.  This also makes escaping to raw SQL easy to integrate with pure DAL code, which is sometimes necessary for speed if you’re doing big database exports or table updates.  The real problem with the DAL is its approach to indexing.  Every table must include an integer field as its primary key.  This probably makes a lot of sense to beginners, but it is less natural when you consider real data that comes with its own identifiers, rather than auto incremented integers.   For example, every named individual in OWL is identified by an IRI, so the string version of the IRI is the natural identifier for OWL objects.  Likewise, many publications have doi’s and academic authors are signing up for ORCID ids.  These external identifiers are frequently the best choice for indexing (despite being strings, there isn’t substantial cost to making your primary key a string).  Things are slightly more complicated in the case of arachnolingua because several tables may have either external IRIs or lack an external IRI and need a unique one generated.  Since it is best to let the database handle uniqueness checking, I currently need to maintain all the generated identifiers in a separate table to enforce the cross-table uniqueness.  Regardless, and certainly for imported terms, there is no intrinsic need for an integer index.

In addition, the integer index appears in other tables, such as those supporting many-to-many mappings where they would be completely useless for the DAL’s need for them.  The DAL also makes it very easy to retrieve rows via the integer primary key, relative to the select statements required for any other field.  This makes it very easy to let unnecessary dependence on the integer key into your code.

Overall, this dependence on integer primary key columns may make sense for small systems and databases that aren’t accessed by multiple applications, but I expect that more experienced database analysts would cringe.  I don’t consider myself to be any sort of serious database analyst, but the problem here is becoming more obvious with time.

3. Python 3 is the future of python.  This is no longer particularly controversial.  The official end of life date for python 2.7 is 2020. Although there was an effort to move web2py to python 3 (along with other changed), the effort, called web3py, has apparently stopped or terminated completely.  This has consequences both in 2020 and now, as all continued development of new features is focused on python 3.  Only important bug fixes (e.g., security) are being back ported to python 2.7.  The other major python web frameworks already support python 3.  They furthermore seem to be inhabiting more of a shared software ecosystem where improvements and widgets are more easily shared.  Web2py seems to be a bit of an outlier, and although there is a sizeable community at this point, the decline of python 2 will likely lead to the decline of the web2py community.

If the web2py developer(s) decide to build a python 3 release, it would seem the sooner they start the better.  This is both because python development will be a moving target, but also because getting a debugged, reliable python 3 version will be a prerequisite for many web2py users to consider the upgrade to python3.  The sooner web2py is upgraded, the more time before 2020 web2py user will have.

Given these problems, it’s time to start making plans to move arachadmin to pyramid (unless I discover a compelling reason to go with a different platform).  Currently I don’t plan to release a packaged arachadmin prior to the conversion, since the installation process will surely change along the way.  If there is interest, I could put up an installation on a small AWS instance, connected to a copy of the database that people could look at and explore (if it gets trashed, I can always upload a fresh copy).

I’ve been thinking about this for six months.  I am happy to have this off my chest.  If you have read this far, thank you for reading through this.  Comments are most welcome.

Curation Update

As promised a few weeks back, I recently (more or less) completed some pretty substantial changes in the curation tool.  Specifically, I have changed the process of specifying ‘participants’ (actors, objects) associated with behavior events.  Participants typically include a spider or its anatomical parts and possibly another organism (mate, prey) or a substrate (silk, plant parts, a surface).  The old version simply allowed choice of a term for (spider) taxon and anatomy as the primary participant and a secondary substrate participant.  This allowed capturing simple events, but was completely inadequate for capturing social or predation interactions.  The old version did have the virtue of simplicity, and may return in some form of ‘wizard tool’ in the future.

The new version runs rather closer to the OWL that ultimately gets generated.  Claims consist of a ‘root’ element that captures the behavior term and a set of participants, each a tree structure (potentially, though most commonly a chain) of participant ‘elements’ that correspond to OWL classes, linked by arcs corresponding to OWL predicates (aka OBO relations).  The tree for a participant more or less corresponds to an OWL class or individual expression.  The interface still requires a great deal of polish, but I’ve added a simple d3 graph display that should make the structure of what’s being created a little clearer.

claim_page_bottom

The ‘table of associated participants’ lists each participant using pidgin English and indicates its level of participation: ‘active participant’ or (passive) ‘participant’.  The graph below the table shows the relations between the participant elements (squares) and the root with predicates indicated by the color of connecting lines (red = active participant, orange = participant, blue = part of).   The link table on the right will (when I’ve finished implementing it) show more details about the ‘active’ participant element.  This table, along with color changing when an element is selected, remain to be implemented, though the selection mechanism is there.  Like many d3 graphs, node placement is ‘force-driven’ and can be tweaked by dragging with your pointing device.  The graph display is largely a transformation and simplification of this molecule example.

The corresponding upper part of the page looks like this:

claim_page_top

This has been changed relatively little, except that drop selection fields for terms from taxon, anatomy, and environment terms have disappeared, they now appear during the element creation dialog that appears as an ajax form in the participant element section.  The text fields for publication taxon, anatomy and substrate remain.  These text metadata fields are associated with the participant as a whole, rather than the corresponding elements.

I’ve started writing some tests for the curation tool.  Adding Selenium to walk through the screens was relatively straightforward.  Unit tests have been rather more of a challenge, due, in part, to the eccentricities of the web2py framework.  But the issues I’ve encountered with web2py are large enough to deserve a post of their own.  Suffice it to say, there are some doctests in the code, but traditional unit tests remain in the future.

This is what I had been working on, as time permitted, over the second half of 2014.  I’ve now switched back to the owl generation side, which, as you may remember is a java tool that pulls from the database to build using the OWLAPI.  There is a lot of work to do there as well.  Since I switched I’ve started using the EclEmma test coverage module to try to improve the rather pathetic test harness I had for the owl generation.  At least unit testing has been fairly straightforward for now.

Quick update

It’s been a couple of months, but if you look over at github, you’ll see a merge that represents the (more or less for now) completion of a major overhaul to the claim editor in arachadmin.  Since the claim editor is the page that allows the curator to construct annotations for behavioral events or assertions, it’s effectively the heart of arachnolingua.  In brief, the interface has changed from a ‘wizard like’ form where the curator would specify a taxon, a body part and optionally a substrate, to something closer to a graph that represents the final OWL class expression.  I think this was the right way to handle getting individuals into claim expressions properly.  It also should make for a better curation experience, and allowed me to play with displaying graphics with d3.

This is not the final word on this update – there will be pictures in a follow-up post.  Meanwhile, it’s time to review the state of the backend and what needs to be added to capture the redesign that’s occurred in the database.

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:

Image

Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.