Whither arachadmin and web2py?

IMG_20150612_174911730

As some of you know, my ‘day job’ is developing for the Open Tree of Life project.  I started development on arachadmin about the time I was starting with opentree.  My first opentree assignment was to add functionality to Rick Ree’s phylografter tool.  Briefly, phylografter provides a database of trees and tools for displaying and grafting portions of trees together.  The important part for now is that the tool was implemented using web2py and used mysql for the database.  At that point, the arachnolingua database consisted of a spreadsheet containing details of around 530 publications.  I had given thought to moving the publication spreadsheet to a proper database, but I didn’t see any obvious lightweight java solutions for building a gui-application to manage this data.  In theory I could have used an existing tool via an api, but I wasn’t sure where to look.  Meanwhile, phylografter presented me with a working model of a web-based gui interface to a relational database.  NOSQL didn’t seem like the right place to begin with since I was starting with a (fairly) well defined table in the spreadsheet.  It also meant I would be learning python and web2py on both projects, which seemed (and still seems) like a good way to ease task switching.

Starting with the publication table, I built up arachadmin using web2py, learning python and some more bits of javascript along the way.  The web2py environment is a very good way to get started with python-based web applications.  It provides lots of pieces for working with databases and web page building and enforces a model-view-controller architecture.  Even now, I wouldn’t recommend against it for certain types of projects.  Likewise, the rest of the opentree software team used web2py in multiple applications that where assembled to support the opentree website and support tools.

As time went by, I started to notice some problems with the way web2py did things.  Likewise, other opentree people, whose python expertise I greatly respect, started raising issues with the quirks of the web2py environment relative to straight python.  It turns out that our issues don’t exactly overlap (maybe because I’ve wrestled more with web2py’s database access layer (DAL)).  The Aforementioned opentree investigator actually recoded a piece of the web2py system using pyramid rather than web2py, so this would be a way to go.  Of course flask and django  would also be contenders, though at the present I’m liking pyramid as the ideal middle – flask can be built up to larger sites, but I would be converting a medium-sized application to start, so including a few more ‘batteries’ may help the process along.  Django would have everything I would need but would also include things I don’t (it seems focused on CRM type applications, not multifaceted data editors), and might make sharing the database with the owlbuilder backend tool more difficult.

So here are my three issues that will eventually force arachadmin to a new platform.

1. Testing at the unit and larger scales are an integral part of modern software development.  Unit testing isn’t impossible with web2py, but there are serious limited stemming from the way the environment for code is set up.  Python provides two popular unit test frameworks – doctest and unittest.  The later is a pretty standard implementation of the standard xunit testing framework implemented in multiple languages.  Doctest is simplified system that allows inclusion of test cases in a python function’s (or method’s) documentation string.  When doctests are run against a module, each function’s doc string is scanned for pattterns that indicate test followed by expected result.  This works, especially for code in the functional style, which can be tested with minimal or no context.  Testing code requiring set up context (state) is more difficult since putting code in the doc string doesn’t provide much opportunity to setup state.  Unfortunately, web2py functions depend on a lot of state (which also makes it difficult to use tools like pylint).  Most of the state is implicit, so you don’t know what to specify as imports to make a particular variable make sense to testing or code analysis tools.  My impression, which I expect to verify one way or another, is that other web environments make things more explicit, allowing unit tests and code analysis to be implemented and run more smoothly.

2. The Database Access Layer (DAL) is perhaps the most important component of web2py.  It provides a uniform access to a range of SQL and a few NOSQL database managers.  You define your tables and relations in model files (which are python source but with a lot of environment predefined).  The system is fairly robust to table modifications (and would probably be even more so if I were using postgres instead of mysql), and there are usually ways to dig your self out if things break.  In particular, if the system complains, you can dump and reload your database, and if the table specification in your model file agree with the schema in your SQL file, things almost always clear up.  The DAL is well integrated with the web page formatting, providing predefined templates for tables and individual column values, which are fairly easily customizable either through constructor arguments or tweaking the provided javascript and CSS files, which include the (once?) popular Bootstrap.

So what’s the problem?  I rather like that the DAL is not a full object-relational mapping.  The DAL uses objects, but they fairly transparently represent rows, sets of rows, and queries, rather than hiding the SQL behind user objects.  This also makes escaping to raw SQL easy to integrate with pure DAL code, which is sometimes necessary for speed if you’re doing big database exports or table updates.  The real problem with the DAL is its approach to indexing.  Every table must include an integer field as its primary key.  This probably makes a lot of sense to beginners, but it is less natural when you consider real data that comes with its own identifiers, rather than auto incremented integers.   For example, every named individual in OWL is identified by an IRI, so the string version of the IRI is the natural identifier for OWL objects.  Likewise, many publications have doi’s and academic authors are signing up for ORCID ids.  These external identifiers are frequently the best choice for indexing (despite being strings, there isn’t substantial cost to making your primary key a string).  Things are slightly more complicated in the case of arachnolingua because several tables may have either external IRIs or lack an external IRI and need a unique one generated.  Since it is best to let the database handle uniqueness checking, I currently need to maintain all the generated identifiers in a separate table to enforce the cross-table uniqueness.  Regardless, and certainly for imported terms, there is no intrinsic need for an integer index.

In addition, the integer index appears in other tables, such as those supporting many-to-many mappings where they would be completely useless for the DAL’s need for them.  The DAL also makes it very easy to retrieve rows via the integer primary key, relative to the select statements required for any other field.  This makes it very easy to let unnecessary dependence on the integer key into your code.

Overall, this dependence on integer primary key columns may make sense for small systems and databases that aren’t accessed by multiple applications, but I expect that more experienced database analysts would cringe.  I don’t consider myself to be any sort of serious database analyst, but the problem here is becoming more obvious with time.

3. Python 3 is the future of python.  This is no longer particularly controversial.  The official end of life date for python 2.7 is 2020. Although there was an effort to move web2py to python 3 (along with other changed), the effort, called web3py, has apparently stopped or terminated completely.  This has consequences both in 2020 and now, as all continued development of new features is focused on python 3.  Only important bug fixes (e.g., security) are being back ported to python 2.7.  The other major python web frameworks already support python 3.  They furthermore seem to be inhabiting more of a shared software ecosystem where improvements and widgets are more easily shared.  Web2py seems to be a bit of an outlier, and although there is a sizeable community at this point, the decline of python 2 will likely lead to the decline of the web2py community.

If the web2py developer(s) decide to build a python 3 release, it would seem the sooner they start the better.  This is both because python development will be a moving target, but also because getting a debugged, reliable python 3 version will be a prerequisite for many web2py users to consider the upgrade to python3.  The sooner web2py is upgraded, the more time before 2020 web2py user will have.

Given these problems, it’s time to start making plans to move arachadmin to pyramid (unless I discover a compelling reason to go with a different platform).  Currently I don’t plan to release a packaged arachadmin prior to the conversion, since the installation process will surely change along the way.  If there is interest, I could put up an installation on a small AWS instance, connected to a copy of the database that people could look at and explore (if it gets trashed, I can always upload a fresh copy).

I’ve been thinking about this for six months.  I am happy to have this off my chest.  If you have read this far, thank you for reading through this.  Comments are most welcome.

Advertisements

A few thoughts on narrative_objects

2015-06-22 14.11.30Phidippus sp. from the AAS 2015 after meeting trip.

Arachnolingua uses the notion of a narrative as a container of claims relating to a set of ordered events and their associated individual organisms and substrates.  In simpler words, a set of actions that make a story because they have an order.  It is important to be able to talk about narratives in the natural history/ethogram context because narrative descriptions are common.  They are different than statements about types (e.g., colors of body parts of particular species) because they are about individuals and the events that those individuals participate in.

There is surprisingly little in large set of OBO vocabularies that captures this sort of data.  There are classes for ordered series of actions when considered as plans (e.g., plan specification and planned process).  The first of this is a directive information content entity that specifies a plan and has subclasses such as algorithm, software, programming language and study design.  As a directive, it might denote or describe a class of processes that would realize the out come of the plan, but that’s different from describing what actually happened in particular circumstances at a particular time with particular actors. The second is a process that results from the execution of a plan specification.  The second class is something that might be denoted by a narrative, for example a description of what happened when a plan failed.

There is a class ‘narrative_object‘ in the information artifact ontology.  It is a subclass of information content entity and is defined as a set of propositions.  That’s well and good, but there is no notion of participants or more importantly that these propositions have any sort of (temporal).  However, I am inclined to go with this as it is at least the right sort of thing and will serve as an over abstracted umbrella classes for narrative individuals as they are defined in arachnolingua.  It would be better to subclass this to include ordering and the idea of individuals filling roles, but it will do for now.  I really don’t want to start introducing new classes for this sort of thing if I can possibly avoid it.  Ideally, Arachnolingua wouldn’t have to introduce any classes, but the process of finding OBO homes for needed classes, especially in situations where term brokerage support or even active curation is unavailable will eventually force my hand on this.   But narrative_object will do for now.

Final note: If you look at the editor notes for this term, Larry Hunter thinks this term isn’t useful and should be replaced by textual entity and figure.  He’s half right – it doesn’t provide much more than textual entity, but the way to fix it is to reduce the granularity, not scrap it.  Textual entity may be a more proper superclass for arachnolingua claims than information content entity, but as such, it fails to distinguish claims at the class level from claims about individuals and their events.