What’s changed

Well, I got the website back up in under a week.  I had to build from scratch since I wasn’t able to salvage anything from the AWS instance.  Of course the data and code are all sitting on my laptop, so it might have been just configuration and reloading.  However, time marches on, and since I was installing from scratch, I took the opportunity to update OS (Ubuntu 16.04), Java (1.8), Apache (2.4), Tomcat (minor update) and to upgrade from Sesame 2.8 to rdf4j, its replacement.  One thing that didn’t get upgraded was CORS-filter.  Although it was available, I discovered that Tomcat had equivalent functionality available, just waiting to be configured.  CORS-filter has served me well and it seems to still be maintained.

Once I had everything configured, it seems that queries are somewhat faster, which I assume is thanks to rdf4j and possibly upgrading java.

A few days later, I took advantage of some posts I had found while figuring out how to reconnect apache and tomcat to set things up for https, so now all transactions to arachb.org and arachnolingua should be redirected to an https connection.  Switching over to https wasn’t much more difficult that configuring Apache in the first place.   I went with a commercial certificate provider since I had paid for a certificate I never activated a few years ago.  Hopefully the renewal will be easy.

On the inside, Apache and Tomcat are speaking over the JServ protocol (ajp13) which may help with performance as well.

Work on the curation tool is continuing, and I think I have found the sweet spot mix of methods, url redirects, and jinja2 templates to get the claim editor to finally lay down.  The story is all about switching between individual and class expression participants, but I’ll save it for another time, as I do need to finish the post I’m righting on my other blog on the role of machine learning in the study of behavior.

 

Advertisement

Website down

The AWS server I host on died on Wednesday.  It did seem to be a slow death, it was completely unavailable within hours of Amazon notifying me.   I tried to capture images from the EBS volume but the image I created didn’t boot.  So I’m rebuilding from scratch.  I have all the data stored locally, but I am taking the opportunity to update some of the software (Linux to Ubuntu 16.04, Apache to 2.4, Java to 8, Sesame to rdf4j 2).  I’m sticking with Tomcat7 on the backend, though I should probably do some experimenting with Tomcat8 since it claims to be easier to manage.  I will definitely make an image once everything is in place.  There’s a first time for everything and this is my first catastrophic failure on AWS.  Live and learn.

I’m working toward having things back up in the coming week.

 

 

Note to self and others

Arachnolingua focuses its OWL expressions on the claim (= assertion, statement) of an individual behavior event or class expression and properties should start there and work out.  Thus: courtship_event –has_participant–>palp–part_of–>male–has_type–>Habronattus sp.  There may be consequences for this decision (especially for class level statements), but it is better to be consistent and document the design decision here for now.

This should eventually make it into the ‘primer’ documents for the curation tool and the database schema as well.  I wonder if there are any tools in Protege for looking at graphs of individuals – maybe lego?

Last few weeks

Writing lots of unit tests and associated refactoring.  Part of this was inspired by reading though Martin’s (2008) Clean Code,  which had been sitting on my shelf for a couple of years.  Most useful thing I found was Martin’s admission that even he writes big ugly functions on the first pass.  Definitely a lot of cleanup in the new arachcurator editor.  Also triggering some simplification in the database – I’ve removed a many-to-many mapping table (participant2claim) since the relation is really many-to-one.  I think there are a couple of other tables that will suffer the same fate.

I should get back to cleaning up my list of terms as well.

img_20160615_102730841_hdr-1

Exploring Alternatives

One of the nice parts about the switch to pyramid is that it supports a range of alternatives.  For example, at the moment I am trying out forms in WTForms after converting the templates to jinja2.  In some ways the arachcurator tool has become a hybrid of technologies from pylons and from flask.  I guess this means if pylons disappears tomorrow, I won’t be starting from scratch.  It was a gut-level instinct call that lead me to prefer pyramid over flask in the first place.

Meanwhile, all the browsing code is built, tested, and now back in a state of flux as I shift forms technology.  There are 3djs graphs for both individuals and their part as well as expanded display trees on the claim page.  I have added users (mostly for change tracking) and three levels of authorization with the hope to put an arachcurator server up on AWS at some point (sharing more than code and demonstrating replicability I hope).  I even went so far as to add password encoding with passlib.

Meanwhile, as discussed here, I’ve started building a T-box vocabulary for spider behavior.  This will extend what can be said in annotations, in a way that changes here will only simplify.

 

 

Updating Sesame, starting the move to pyramid/postgresql

All technical stuff this week.  I’ve got the Spider-Behavior server code connected to travis-ci.  In the process of doing so, I ran into problem with Sesame, which turned out to be related to the maven setup.   It’s working and updated to the most recent 2.8 release.  This will allow some more changes, particularly some more formatting work on the narratives page.

While waiting for a response from the Sesame-users group, I started taking a serious look at moving arachadmin from web2py.  It looks like I’ll be able to move to a python3.5-pyramid-postgresql stack.  I have a first pass at the new table structure working with sqlalchemy, and will soon start the process of coding the transfer from mysql.  The experience is a lot closer to standard python development, rather than the somewhat cushioned ride offered by web2py.

Meanwhile, spending some time reviewing the opentree taxonomy against the world spider catalog.  So far nothing horrible, mostly groups added since 2009 and some unstable family splits.

No big news for Arachtober

Spider in leaf refuge

Unidentified spider constructing rolled leaf refuge.

Things weren’t really cleaned up enough for a big release announcement in October, though almost every page has had an update in the last month.  In addition to narrative ‘drill-down’ pages, all the other arachb.org/arachb/ARACHB_xxxxxxx identifiers that correspond to something in the database now resolve to simple pages.  The narrative drill-down pages (such as this) generate a table of their component events.  They can generate both html and json, the later is (in many cases) simply the ‘tunneled’ json that Owlbuilder serializes into an rdfs:comment annotation (as discussed here). These still need some tweaking and there is still information for ordering the events in the narrative.  This, table pagination, and some javascript cleanup remain before I could consider a release.  Then it will be time for some more curation and thoughts of updating the curation tool (python 3.5 and pyramid and mysql -> postgres).

Owlbuilder and server backend code are gradually migrating to Java 1.8.  Meanwhile Oracle’s cloud-related noises are making me nervous about the future of Java on non-Oracle environments.   Events conspire to make me nervous about Java every couple of years, so maybe it’s not time to worry yet.

Why Arachnolingua will not use dc:terms to describe publications

Along with all the buzz and excitement (and over loaded servers) that accompanied the release of the OpenTree PNAS paper, there was a question about why OpenTree was not publishing RDF.  Jonathan Rees ably answered the question here (I refuse to quibble about the implication that I don’t know RDF), but the gist was that it could have been done, but RDF is easy to do badly, but hard to do right.  This is very true, and led me to reconsider, not whether to generate OWL, but how well I was living up to best practices.  Arachnolingua is not really an ontology and certainly doesn’t aspire to OBO foundry status, but many of the practices they recommend are worth following.  All the code and data are available, standard vocabularies are used whenever possible, and I will open this up to wider collaboration once things have settled down.

About the same time that the OpenTree paper came out, I was considering the implementation of landing pages for publications, as well as terms in the ARACHB namespace.  I am implementing a service that will provide a page for all terms of the form http://arachb.org/ARAIMG_20150909_080000443CHB/ARACHB_XXXXXXX.  These include generated identifiers for claims and individuals (more about that in a later post) as well as taxa without an NCBI id and publications without a doi.   The landing pages will be minimal to start and will be available as json as well as html.   There will be landing pages for publications with a doi as well, simply because there are things that Arachnolinuga has to say about a publication apart from what is on the publisher’s page that a doi generally resolves to (e.g., number of curated claims, individuals, etc.).

It is reasonable to expect that a landing page for publications would include the standard bibliographic information, especially for publications that lack a stable identifier (doi).  The information is available in the arachadmin database.  I originally thought to just generate pages, later to generate json files for each identifier with an html template, then realized that there was no compelling reason not to add this to the individual representing each publication.

Of course at this point, it would seem obvious to use the Dublin Core vocabulary.  But from my experience on the Dryad project as well as a discussion with Jonathan about two years ago made me suspicious.  I was already using the OBO IAO (information artifact ontology) so I started looking at IAO as an alternative to Dublin Core (DC).  Not a lot has been written about this, but Barry Smith has had something to say in this powerpoint.  Smith’s presentation makes it clear that while IAO is an ontology of information artifacts, DC is a vocabulary for tagging artifacts.  But looking at the set of properties provided by DC, it is less a vocabulary about how to find an artifact and more about rights management.  It looks like the DC vocabulary is used in places where it can be applied (e.g., dc:title, dc:contributor, dc:author) but for location, it merely provides a place to provide a link, not a bibliographic reference.  This matches my experience with both Dryad and Opentree – in the former we used DC terms where possible, and other vocabularies (e.g., prism) for other properties and in Opentree the DC vocabulary is not used at all, in favor of leaving the publication citation as an unparsed string, breaking out only the publication year, and using the ot namespace defined for the jsonized NeXML that studies are stored in.

In my searches, I found one project that, instead of using dc terms in the rdf, transformed all the dc properties into rdf:comment strings.  I plan to implement something similar, but more focused.  The bibliographic information will simply be turned into a json string that will appear as an rdf:comment on the publication individual.  This avoids needing to import the DC vocabulary or define a namespace for what are ultimately annotation strings.  The json will make page generation easy (the json page will simply be the untransformed comment string) and its meaning will be clear to anyone inspecting it.

Announcement: There is an update of Arachnolingua coming.  Preview soon at http://dev.arachb.org/

Whither arachadmin and web2py?

IMG_20150612_174911730

As some of you know, my ‘day job’ is developing for the Open Tree of Life project.  I started development on arachadmin about the time I was starting with opentree.  My first opentree assignment was to add functionality to Rick Ree’s phylografter tool.  Briefly, phylografter provides a database of trees and tools for displaying and grafting portions of trees together.  The important part for now is that the tool was implemented using web2py and used mysql for the database.  At that point, the arachnolingua database consisted of a spreadsheet containing details of around 530 publications.  I had given thought to moving the publication spreadsheet to a proper database, but I didn’t see any obvious lightweight java solutions for building a gui-application to manage this data.  In theory I could have used an existing tool via an api, but I wasn’t sure where to look.  Meanwhile, phylografter presented me with a working model of a web-based gui interface to a relational database.  NOSQL didn’t seem like the right place to begin with since I was starting with a (fairly) well defined table in the spreadsheet.  It also meant I would be learning python and web2py on both projects, which seemed (and still seems) like a good way to ease task switching.

Starting with the publication table, I built up arachadmin using web2py, learning python and some more bits of javascript along the way.  The web2py environment is a very good way to get started with python-based web applications.  It provides lots of pieces for working with databases and web page building and enforces a model-view-controller architecture.  Even now, I wouldn’t recommend against it for certain types of projects.  Likewise, the rest of the opentree software team used web2py in multiple applications that where assembled to support the opentree website and support tools.

As time went by, I started to notice some problems with the way web2py did things.  Likewise, other opentree people, whose python expertise I greatly respect, started raising issues with the quirks of the web2py environment relative to straight python.  It turns out that our issues don’t exactly overlap (maybe because I’ve wrestled more with web2py’s database access layer (DAL)).  The Aforementioned opentree investigator actually recoded a piece of the web2py system using pyramid rather than web2py, so this would be a way to go.  Of course flask and django  would also be contenders, though at the present I’m liking pyramid as the ideal middle – flask can be built up to larger sites, but I would be converting a medium-sized application to start, so including a few more ‘batteries’ may help the process along.  Django would have everything I would need but would also include things I don’t (it seems focused on CRM type applications, not multifaceted data editors), and might make sharing the database with the owlbuilder backend tool more difficult.

So here are my three issues that will eventually force arachadmin to a new platform.

1. Testing at the unit and larger scales are an integral part of modern software development.  Unit testing isn’t impossible with web2py, but there are serious limited stemming from the way the environment for code is set up.  Python provides two popular unit test frameworks – doctest and unittest.  The later is a pretty standard implementation of the standard xunit testing framework implemented in multiple languages.  Doctest is simplified system that allows inclusion of test cases in a python function’s (or method’s) documentation string.  When doctests are run against a module, each function’s doc string is scanned for pattterns that indicate test followed by expected result.  This works, especially for code in the functional style, which can be tested with minimal or no context.  Testing code requiring set up context (state) is more difficult since putting code in the doc string doesn’t provide much opportunity to setup state.  Unfortunately, web2py functions depend on a lot of state (which also makes it difficult to use tools like pylint).  Most of the state is implicit, so you don’t know what to specify as imports to make a particular variable make sense to testing or code analysis tools.  My impression, which I expect to verify one way or another, is that other web environments make things more explicit, allowing unit tests and code analysis to be implemented and run more smoothly.

2. The Database Access Layer (DAL) is perhaps the most important component of web2py.  It provides a uniform access to a range of SQL and a few NOSQL database managers.  You define your tables and relations in model files (which are python source but with a lot of environment predefined).  The system is fairly robust to table modifications (and would probably be even more so if I were using postgres instead of mysql), and there are usually ways to dig your self out if things break.  In particular, if the system complains, you can dump and reload your database, and if the table specification in your model file agree with the schema in your SQL file, things almost always clear up.  The DAL is well integrated with the web page formatting, providing predefined templates for tables and individual column values, which are fairly easily customizable either through constructor arguments or tweaking the provided javascript and CSS files, which include the (once?) popular Bootstrap.

So what’s the problem?  I rather like that the DAL is not a full object-relational mapping.  The DAL uses objects, but they fairly transparently represent rows, sets of rows, and queries, rather than hiding the SQL behind user objects.  This also makes escaping to raw SQL easy to integrate with pure DAL code, which is sometimes necessary for speed if you’re doing big database exports or table updates.  The real problem with the DAL is its approach to indexing.  Every table must include an integer field as its primary key.  This probably makes a lot of sense to beginners, but it is less natural when you consider real data that comes with its own identifiers, rather than auto incremented integers.   For example, every named individual in OWL is identified by an IRI, so the string version of the IRI is the natural identifier for OWL objects.  Likewise, many publications have doi’s and academic authors are signing up for ORCID ids.  These external identifiers are frequently the best choice for indexing (despite being strings, there isn’t substantial cost to making your primary key a string).  Things are slightly more complicated in the case of arachnolingua because several tables may have either external IRIs or lack an external IRI and need a unique one generated.  Since it is best to let the database handle uniqueness checking, I currently need to maintain all the generated identifiers in a separate table to enforce the cross-table uniqueness.  Regardless, and certainly for imported terms, there is no intrinsic need for an integer index.

In addition, the integer index appears in other tables, such as those supporting many-to-many mappings where they would be completely useless for the DAL’s need for them.  The DAL also makes it very easy to retrieve rows via the integer primary key, relative to the select statements required for any other field.  This makes it very easy to let unnecessary dependence on the integer key into your code.

Overall, this dependence on integer primary key columns may make sense for small systems and databases that aren’t accessed by multiple applications, but I expect that more experienced database analysts would cringe.  I don’t consider myself to be any sort of serious database analyst, but the problem here is becoming more obvious with time.

3. Python 3 is the future of python.  This is no longer particularly controversial.  The official end of life date for python 2.7 is 2020. Although there was an effort to move web2py to python 3 (along with other changed), the effort, called web3py, has apparently stopped or terminated completely.  This has consequences both in 2020 and now, as all continued development of new features is focused on python 3.  Only important bug fixes (e.g., security) are being back ported to python 2.7.  The other major python web frameworks already support python 3.  They furthermore seem to be inhabiting more of a shared software ecosystem where improvements and widgets are more easily shared.  Web2py seems to be a bit of an outlier, and although there is a sizeable community at this point, the decline of python 2 will likely lead to the decline of the web2py community.

If the web2py developer(s) decide to build a python 3 release, it would seem the sooner they start the better.  This is both because python development will be a moving target, but also because getting a debugged, reliable python 3 version will be a prerequisite for many web2py users to consider the upgrade to python3.  The sooner web2py is upgraded, the more time before 2020 web2py user will have.

Given these problems, it’s time to start making plans to move arachadmin to pyramid (unless I discover a compelling reason to go with a different platform).  Currently I don’t plan to release a packaged arachadmin prior to the conversion, since the installation process will surely change along the way.  If there is interest, I could put up an installation on a small AWS instance, connected to a copy of the database that people could look at and explore (if it gets trashed, I can always upload a fresh copy).

I’ve been thinking about this for six months.  I am happy to have this off my chest.  If you have read this far, thank you for reading through this.  Comments are most welcome.

ABS 2015

Still traveling – left for Alaska on June 3, attended Animal Behavior Society 2015 meeting 10-14 June, stop over in California then on my way to the American Arachnological Society meeting starting on the 19th.  As usual, Schizocosa was somewhat overrepresented (not just the Uetz and Hebets labs), but widows and tarantulas (African) were also represented.  I also gave a talk, which reviewed the poster results from last year and provided (I hope) motivation for the implementation course I have taken for individuals and narratives.  You can find it here.

Meanwhile, lots of work on OwlBuilder, but things aren’t ready to move on to an update to the web presence.  I do have a new AWS server in reserve for this however.