Note to self and others

Arachnolingua focuses its OWL expressions on the claim (= assertion, statement) of an individual behavior event or class expression and properties should start there and work out.  Thus: courtship_event –has_participant–>palp–part_of–>male–has_type–>Habronattus sp.  There may be consequences for this decision (especially for class level statements), but it is better to be consistent and document the design decision here for now.

This should eventually make it into the ‘primer’ documents for the curation tool and the database schema as well.  I wonder if there are any tools in Protege for looking at graphs of individuals – maybe lego?

Advertisements

Last few weeks

Writing lots of unit tests and associated refactoring.  Part of this was inspired by reading though Martin’s (2008) Clean Code,  which had been sitting on my shelf for a couple of years.  Most useful thing I found was Martin’s admission that even he writes big ugly functions on the first pass.  Definitely a lot of cleanup in the new arachcurator editor.  Also triggering some simplification in the database – I’ve removed a many-to-many mapping table (participant2claim) since the relation is really many-to-one.  I think there are a couple of other tables that will suffer the same fate.

I should get back to cleaning up my list of terms as well.

img_20160615_102730841_hdr-1

Exploring Alternatives

One of the nice parts about the switch to pyramid is that it supports a range of alternatives.  For example, at the moment I am trying out forms in WTForms after converting the templates to jinja2.  In some ways the arachcurator tool has become a hybrid of technologies from pylons and from flask.  I guess this means if pylons disappears tomorrow, I won’t be starting from scratch.  It was a gut-level instinct call that lead me to prefer pyramid over flask in the first place.

Meanwhile, all the browsing code is built, tested, and now back in a state of flux as I shift forms technology.  There are 3djs graphs for both individuals and their part as well as expanded display trees on the claim page.  I have added users (mostly for change tracking) and three levels of authorization with the hope to put an arachcurator server up on AWS at some point (sharing more than code and demonstrating replicability I hope).  I even went so far as to add password encoding with passlib.

Meanwhile, as discussed here, I’ve started building a T-box vocabulary for spider behavior.  This will extend what can be said in annotations, in a way that changes here will only simplify.

 

 

Updating Sesame, starting the move to pyramid/postgresql

All technical stuff this week.  I’ve got the Spider-Behavior server code connected to travis-ci.  In the process of doing so, I ran into problem with Sesame, which turned out to be related to the maven setup.   It’s working and updated to the most recent 2.8 release.  This will allow some more changes, particularly some more formatting work on the narratives page.

While waiting for a response from the Sesame-users group, I started taking a serious look at moving arachadmin from web2py.  It looks like I’ll be able to move to a python3.5-pyramid-postgresql stack.  I have a first pass at the new table structure working with sqlalchemy, and will soon start the process of coding the transfer from mysql.  The experience is a lot closer to standard python development, rather than the somewhat cushioned ride offered by web2py.

Meanwhile, spending some time reviewing the opentree taxonomy against the world spider catalog.  So far nothing horrible, mostly groups added since 2009 and some unstable family splits.

No big news for Arachtober

Spider in leaf refuge

Unidentified spider constructing rolled leaf refuge.

Things weren’t really cleaned up enough for a big release announcement in October, though almost every page has had an update in the last month.  In addition to narrative ‘drill-down’ pages, all the other arachb.org/arachb/ARACHB_xxxxxxx identifiers that correspond to something in the database now resolve to simple pages.  The narrative drill-down pages (such as this) generate a table of their component events.  They can generate both html and json, the later is (in many cases) simply the ‘tunneled’ json that Owlbuilder serializes into an rdfs:comment annotation (as discussed here). These still need some tweaking and there is still information for ordering the events in the narrative.  This, table pagination, and some javascript cleanup remain before I could consider a release.  Then it will be time for some more curation and thoughts of updating the curation tool (python 3.5 and pyramid and mysql -> postgres).

Owlbuilder and server backend code are gradually migrating to Java 1.8.  Meanwhile Oracle’s cloud-related noises are making me nervous about the future of Java on non-Oracle environments.   Events conspire to make me nervous about Java every couple of years, so maybe it’s not time to worry yet.

Whither arachadmin and web2py?

IMG_20150612_174911730

As some of you know, my ‘day job’ is developing for the Open Tree of Life project.  I started development on arachadmin about the time I was starting with opentree.  My first opentree assignment was to add functionality to Rick Ree’s phylografter tool.  Briefly, phylografter provides a database of trees and tools for displaying and grafting portions of trees together.  The important part for now is that the tool was implemented using web2py and used mysql for the database.  At that point, the arachnolingua database consisted of a spreadsheet containing details of around 530 publications.  I had given thought to moving the publication spreadsheet to a proper database, but I didn’t see any obvious lightweight java solutions for building a gui-application to manage this data.  In theory I could have used an existing tool via an api, but I wasn’t sure where to look.  Meanwhile, phylografter presented me with a working model of a web-based gui interface to a relational database.  NOSQL didn’t seem like the right place to begin with since I was starting with a (fairly) well defined table in the spreadsheet.  It also meant I would be learning python and web2py on both projects, which seemed (and still seems) like a good way to ease task switching.

Starting with the publication table, I built up arachadmin using web2py, learning python and some more bits of javascript along the way.  The web2py environment is a very good way to get started with python-based web applications.  It provides lots of pieces for working with databases and web page building and enforces a model-view-controller architecture.  Even now, I wouldn’t recommend against it for certain types of projects.  Likewise, the rest of the opentree software team used web2py in multiple applications that where assembled to support the opentree website and support tools.

As time went by, I started to notice some problems with the way web2py did things.  Likewise, other opentree people, whose python expertise I greatly respect, started raising issues with the quirks of the web2py environment relative to straight python.  It turns out that our issues don’t exactly overlap (maybe because I’ve wrestled more with web2py’s database access layer (DAL)).  The Aforementioned opentree investigator actually recoded a piece of the web2py system using pyramid rather than web2py, so this would be a way to go.  Of course flask and django  would also be contenders, though at the present I’m liking pyramid as the ideal middle – flask can be built up to larger sites, but I would be converting a medium-sized application to start, so including a few more ‘batteries’ may help the process along.  Django would have everything I would need but would also include things I don’t (it seems focused on CRM type applications, not multifaceted data editors), and might make sharing the database with the owlbuilder backend tool more difficult.

So here are my three issues that will eventually force arachadmin to a new platform.

1. Testing at the unit and larger scales are an integral part of modern software development.  Unit testing isn’t impossible with web2py, but there are serious limited stemming from the way the environment for code is set up.  Python provides two popular unit test frameworks – doctest and unittest.  The later is a pretty standard implementation of the standard xunit testing framework implemented in multiple languages.  Doctest is simplified system that allows inclusion of test cases in a python function’s (or method’s) documentation string.  When doctests are run against a module, each function’s doc string is scanned for pattterns that indicate test followed by expected result.  This works, especially for code in the functional style, which can be tested with minimal or no context.  Testing code requiring set up context (state) is more difficult since putting code in the doc string doesn’t provide much opportunity to setup state.  Unfortunately, web2py functions depend on a lot of state (which also makes it difficult to use tools like pylint).  Most of the state is implicit, so you don’t know what to specify as imports to make a particular variable make sense to testing or code analysis tools.  My impression, which I expect to verify one way or another, is that other web environments make things more explicit, allowing unit tests and code analysis to be implemented and run more smoothly.

2. The Database Access Layer (DAL) is perhaps the most important component of web2py.  It provides a uniform access to a range of SQL and a few NOSQL database managers.  You define your tables and relations in model files (which are python source but with a lot of environment predefined).  The system is fairly robust to table modifications (and would probably be even more so if I were using postgres instead of mysql), and there are usually ways to dig your self out if things break.  In particular, if the system complains, you can dump and reload your database, and if the table specification in your model file agree with the schema in your SQL file, things almost always clear up.  The DAL is well integrated with the web page formatting, providing predefined templates for tables and individual column values, which are fairly easily customizable either through constructor arguments or tweaking the provided javascript and CSS files, which include the (once?) popular Bootstrap.

So what’s the problem?  I rather like that the DAL is not a full object-relational mapping.  The DAL uses objects, but they fairly transparently represent rows, sets of rows, and queries, rather than hiding the SQL behind user objects.  This also makes escaping to raw SQL easy to integrate with pure DAL code, which is sometimes necessary for speed if you’re doing big database exports or table updates.  The real problem with the DAL is its approach to indexing.  Every table must include an integer field as its primary key.  This probably makes a lot of sense to beginners, but it is less natural when you consider real data that comes with its own identifiers, rather than auto incremented integers.   For example, every named individual in OWL is identified by an IRI, so the string version of the IRI is the natural identifier for OWL objects.  Likewise, many publications have doi’s and academic authors are signing up for ORCID ids.  These external identifiers are frequently the best choice for indexing (despite being strings, there isn’t substantial cost to making your primary key a string).  Things are slightly more complicated in the case of arachnolingua because several tables may have either external IRIs or lack an external IRI and need a unique one generated.  Since it is best to let the database handle uniqueness checking, I currently need to maintain all the generated identifiers in a separate table to enforce the cross-table uniqueness.  Regardless, and certainly for imported terms, there is no intrinsic need for an integer index.

In addition, the integer index appears in other tables, such as those supporting many-to-many mappings where they would be completely useless for the DAL’s need for them.  The DAL also makes it very easy to retrieve rows via the integer primary key, relative to the select statements required for any other field.  This makes it very easy to let unnecessary dependence on the integer key into your code.

Overall, this dependence on integer primary key columns may make sense for small systems and databases that aren’t accessed by multiple applications, but I expect that more experienced database analysts would cringe.  I don’t consider myself to be any sort of serious database analyst, but the problem here is becoming more obvious with time.

3. Python 3 is the future of python.  This is no longer particularly controversial.  The official end of life date for python 2.7 is 2020. Although there was an effort to move web2py to python 3 (along with other changed), the effort, called web3py, has apparently stopped or terminated completely.  This has consequences both in 2020 and now, as all continued development of new features is focused on python 3.  Only important bug fixes (e.g., security) are being back ported to python 2.7.  The other major python web frameworks already support python 3.  They furthermore seem to be inhabiting more of a shared software ecosystem where improvements and widgets are more easily shared.  Web2py seems to be a bit of an outlier, and although there is a sizeable community at this point, the decline of python 2 will likely lead to the decline of the web2py community.

If the web2py developer(s) decide to build a python 3 release, it would seem the sooner they start the better.  This is both because python development will be a moving target, but also because getting a debugged, reliable python 3 version will be a prerequisite for many web2py users to consider the upgrade to python3.  The sooner web2py is upgraded, the more time before 2020 web2py user will have.

Given these problems, it’s time to start making plans to move arachadmin to pyramid (unless I discover a compelling reason to go with a different platform).  Currently I don’t plan to release a packaged arachadmin prior to the conversion, since the installation process will surely change along the way.  If there is interest, I could put up an installation on a small AWS instance, connected to a copy of the database that people could look at and explore (if it gets trashed, I can always upload a fresh copy).

I’ve been thinking about this for six months.  I am happy to have this off my chest.  If you have read this far, thank you for reading through this.  Comments are most welcome.

Curation Update

As promised a few weeks back, I recently (more or less) completed some pretty substantial changes in the curation tool.  Specifically, I have changed the process of specifying ‘participants’ (actors, objects) associated with behavior events.  Participants typically include a spider or its anatomical parts and possibly another organism (mate, prey) or a substrate (silk, plant parts, a surface).  The old version simply allowed choice of a term for (spider) taxon and anatomy as the primary participant and a secondary substrate participant.  This allowed capturing simple events, but was completely inadequate for capturing social or predation interactions.  The old version did have the virtue of simplicity, and may return in some form of ‘wizard tool’ in the future.

The new version runs rather closer to the OWL that ultimately gets generated.  Claims consist of a ‘root’ element that captures the behavior term and a set of participants, each a tree structure (potentially, though most commonly a chain) of participant ‘elements’ that correspond to OWL classes, linked by arcs corresponding to OWL predicates (aka OBO relations).  The tree for a participant more or less corresponds to an OWL class or individual expression.  The interface still requires a great deal of polish, but I’ve added a simple d3 graph display that should make the structure of what’s being created a little clearer.

claim_page_bottom

The ‘table of associated participants’ lists each participant using pidgin English and indicates its level of participation: ‘active participant’ or (passive) ‘participant’.  The graph below the table shows the relations between the participant elements (squares) and the root with predicates indicated by the color of connecting lines (red = active participant, orange = participant, blue = part of).   The link table on the right will (when I’ve finished implementing it) show more details about the ‘active’ participant element.  This table, along with color changing when an element is selected, remain to be implemented, though the selection mechanism is there.  Like many d3 graphs, node placement is ‘force-driven’ and can be tweaked by dragging with your pointing device.  The graph display is largely a transformation and simplification of this molecule example.

The corresponding upper part of the page looks like this:

claim_page_top

This has been changed relatively little, except that drop selection fields for terms from taxon, anatomy, and environment terms have disappeared, they now appear during the element creation dialog that appears as an ajax form in the participant element section.  The text fields for publication taxon, anatomy and substrate remain.  These text metadata fields are associated with the participant as a whole, rather than the corresponding elements.

I’ve started writing some tests for the curation tool.  Adding Selenium to walk through the screens was relatively straightforward.  Unit tests have been rather more of a challenge, due, in part, to the eccentricities of the web2py framework.  But the issues I’ve encountered with web2py are large enough to deserve a post of their own.  Suffice it to say, there are some doctests in the code, but traditional unit tests remain in the future.

This is what I had been working on, as time permitted, over the second half of 2014.  I’ve now switched back to the owl generation side, which, as you may remember is a java tool that pulls from the database to build using the OWLAPI.  There is a lot of work to do there as well.  Since I switched I’ve started using the EclEmma test coverage module to try to improve the rather pathetic test harness I had for the owl generation.  At least unit testing has been fairly straightforward for now.