What’s changed

Well, I got the website back up in under a week.  I had to build from scratch since I wasn’t able to salvage anything from the AWS instance.  Of course the data and code are all sitting on my laptop, so it might have been just configuration and reloading.  However, time marches on, and since I was installing from scratch, I took the opportunity to update OS (Ubuntu 16.04), Java (1.8), Apache (2.4), Tomcat (minor update) and to upgrade from Sesame 2.8 to rdf4j, its replacement.  One thing that didn’t get upgraded was CORS-filter.  Although it was available, I discovered that Tomcat had equivalent functionality available, just waiting to be configured.  CORS-filter has served me well and it seems to still be maintained.

Once I had everything configured, it seems that queries are somewhat faster, which I assume is thanks to rdf4j and possibly upgrading java.

A few days later, I took advantage of some posts I had found while figuring out how to reconnect apache and tomcat to set things up for https, so now all transactions to arachb.org and arachnolingua should be redirected to an https connection.  Switching over to https wasn’t much more difficult that configuring Apache in the first place.   I went with a commercial certificate provider since I had paid for a certificate I never activated a few years ago.  Hopefully the renewal will be easy.

On the inside, Apache and Tomcat are speaking over the JServ protocol (ajp13) which may help with performance as well.

Work on the curation tool is continuing, and I think I have found the sweet spot mix of methods, url redirects, and jinja2 templates to get the claim editor to finally lay down.  The story is all about switching between individual and class expression participants, but I’ll save it for another time, as I do need to finish the post I’m righting on my other blog on the role of machine learning in the study of behavior.

 

Advertisements

Website down

The AWS server I host on died on Wednesday.  It did seem to be a slow death, it was completely unavailable within hours of Amazon notifying me.   I tried to capture images from the EBS volume but the image I created didn’t boot.  So I’m rebuilding from scratch.  I have all the data stored locally, but I am taking the opportunity to update some of the software (Linux to Ubuntu 16.04, Apache to 2.4, Java to 8, Sesame to rdf4j 2).  I’m sticking with Tomcat7 on the backend, though I should probably do some experimenting with Tomcat8 since it claims to be easier to manage.  I will definitely make an image once everything is in place.  There’s a first time for everything and this is my first catastrophic failure on AWS.  Live and learn.

I’m working toward having things back up in the coming week.

 

 

Updating Sesame, starting the move to pyramid/postgresql

All technical stuff this week.  I’ve got the Spider-Behavior server code connected to travis-ci.  In the process of doing so, I ran into problem with Sesame, which turned out to be related to the maven setup.   It’s working and updated to the most recent 2.8 release.  This will allow some more changes, particularly some more formatting work on the narratives page.

While waiting for a response from the Sesame-users group, I started taking a serious look at moving arachadmin from web2py.  It looks like I’ll be able to move to a python3.5-pyramid-postgresql stack.  I have a first pass at the new table structure working with sqlalchemy, and will soon start the process of coding the transfer from mysql.  The experience is a lot closer to standard python development, rather than the somewhat cushioned ride offered by web2py.

Meanwhile, spending some time reviewing the opentree taxonomy against the world spider catalog.  So far nothing horrible, mostly groups added since 2009 and some unstable family splits.

No big news for Arachtober

Spider in leaf refuge

Unidentified spider constructing rolled leaf refuge.

Things weren’t really cleaned up enough for a big release announcement in October, though almost every page has had an update in the last month.  In addition to narrative ‘drill-down’ pages, all the other arachb.org/arachb/ARACHB_xxxxxxx identifiers that correspond to something in the database now resolve to simple pages.  The narrative drill-down pages (such as this) generate a table of their component events.  They can generate both html and json, the later is (in many cases) simply the ‘tunneled’ json that Owlbuilder serializes into an rdfs:comment annotation (as discussed here). These still need some tweaking and there is still information for ordering the events in the narrative.  This, table pagination, and some javascript cleanup remain before I could consider a release.  Then it will be time for some more curation and thoughts of updating the curation tool (python 3.5 and pyramid and mysql -> postgres).

Owlbuilder and server backend code are gradually migrating to Java 1.8.  Meanwhile Oracle’s cloud-related noises are making me nervous about the future of Java on non-Oracle environments.   Events conspire to make me nervous about Java every couple of years, so maybe it’s not time to worry yet.

Why Arachnolingua will not use dc:terms to describe publications

Along with all the buzz and excitement (and over loaded servers) that accompanied the release of the OpenTree PNAS paper, there was a question about why OpenTree was not publishing RDF.  Jonathan Rees ably answered the question here (I refuse to quibble about the implication that I don’t know RDF), but the gist was that it could have been done, but RDF is easy to do badly, but hard to do right.  This is very true, and led me to reconsider, not whether to generate OWL, but how well I was living up to best practices.  Arachnolingua is not really an ontology and certainly doesn’t aspire to OBO foundry status, but many of the practices they recommend are worth following.  All the code and data are available, standard vocabularies are used whenever possible, and I will open this up to wider collaboration once things have settled down.

About the same time that the OpenTree paper came out, I was considering the implementation of landing pages for publications, as well as terms in the ARACHB namespace.  I am implementing a service that will provide a page for all terms of the form http://arachb.org/ARAIMG_20150909_080000443CHB/ARACHB_XXXXXXX.  These include generated identifiers for claims and individuals (more about that in a later post) as well as taxa without an NCBI id and publications without a doi.   The landing pages will be minimal to start and will be available as json as well as html.   There will be landing pages for publications with a doi as well, simply because there are things that Arachnolinuga has to say about a publication apart from what is on the publisher’s page that a doi generally resolves to (e.g., number of curated claims, individuals, etc.).

It is reasonable to expect that a landing page for publications would include the standard bibliographic information, especially for publications that lack a stable identifier (doi).  The information is available in the arachadmin database.  I originally thought to just generate pages, later to generate json files for each identifier with an html template, then realized that there was no compelling reason not to add this to the individual representing each publication.

Of course at this point, it would seem obvious to use the Dublin Core vocabulary.  But from my experience on the Dryad project as well as a discussion with Jonathan about two years ago made me suspicious.  I was already using the OBO IAO (information artifact ontology) so I started looking at IAO as an alternative to Dublin Core (DC).  Not a lot has been written about this, but Barry Smith has had something to say in this powerpoint.  Smith’s presentation makes it clear that while IAO is an ontology of information artifacts, DC is a vocabulary for tagging artifacts.  But looking at the set of properties provided by DC, it is less a vocabulary about how to find an artifact and more about rights management.  It looks like the DC vocabulary is used in places where it can be applied (e.g., dc:title, dc:contributor, dc:author) but for location, it merely provides a place to provide a link, not a bibliographic reference.  This matches my experience with both Dryad and Opentree – in the former we used DC terms where possible, and other vocabularies (e.g., prism) for other properties and in Opentree the DC vocabulary is not used at all, in favor of leaving the publication citation as an unparsed string, breaking out only the publication year, and using the ot namespace defined for the jsonized NeXML that studies are stored in.

In my searches, I found one project that, instead of using dc terms in the rdf, transformed all the dc properties into rdf:comment strings.  I plan to implement something similar, but more focused.  The bibliographic information will simply be turned into a json string that will appear as an rdf:comment on the publication individual.  This avoids needing to import the DC vocabulary or define a namespace for what are ultimately annotation strings.  The json will make page generation easy (the json page will simply be the untransformed comment string) and its meaning will be clear to anyone inspecting it.

Announcement: There is an update of Arachnolingua coming.  Preview soon at http://dev.arachb.org/

Lots of little things

The pace of Arachnolingua development has picked up a bit.  After adding the first set of annotations for a non-NCBI taxon (Leucauge mariana in a courtship description in Eberhard and Huber (1998) – a rich source of annotatable material), the need to support these curator added taxa more completely became obvious.   The curation side was filling in the details for such taxa – the hook was already there to make a term, but it was not identified as to being a taxonomic name or its source authority (in this case the World Spider Catalog).  That would suffice to make it available as a taxonomy term, but I also needed to add a mechanism to add one or more levels of parentage to connect the term to something already in NCBI.   Fortunately in the case of L. mariana, the parent genus was in NCBI so chaining wasn’t required.  Actually linking via parents was not strictly necessary for the curation process – it just needs to capture the parentage to pass to the OWL generation tool.  Before I finished those changes, the OWL generator created a number of knowledge base files that attached L. mariana to the root OWL term.  This has been finished and L. mariana and a small subset of the behavior patterns documented by Eberhard and Huber are now annotated in Arachnolingua.

I also followed through on a long overdue change that rippled throughout the implementation.  I have been using the term ‘assertion’ throughout the code to refer to a conjunction of taxon, behavior, body part, publication, and eventually evidence.  The problem with calling these ‘assertions’ is simply that the OWL specification talks about assertions that are much smaller – an Arachnolingua assertion can be decomposed into about a dozen OWL assertions or RDF triples (formally an RDF triple is one (probably the most common) way of implementing an OWL assertion.  So, to avoid confusion with assertion (and likewise OWL has a subset of assertions referred to as annotation, so switching the Arachnolingua terminology to annotation would merely change the problem), I started looking for synonyms.

The best synonym for ‘assertion’ that I found seemed to be ‘claim’ – it has the advantage of not sounding definitive and suggesting the need for provenance (taking about a claim leads easily to the question ‘who’s making this claim?’).  Observation was a tempting second choice – most of the ‘assertions’ in arachnolingua are observations, but not all and if I were to use the term observation in Arachnolingua, I would prefer it to refer to claims at the level of individuals – individual spiders doing a particular something on a particular substrate.  So, the software taxonomy, I could look at observations as a subclass of claims – though I haven’t yet seen the need to implement such a subclass.

After going back and forth on whether the switch made sense, I went ahead and implemented it everywhere – curation, database, and owl generator now refer to claims.   In the curation screen, it seemed worthwhile to include observation as that was such a strong alternative:

Image

Changes like this are ugly, but only get uglier the longer they are put off.

Finally, I’ve spend some time on the query component of the website.  As I’ve described before, I’ve implemented an API that supports queries for the implemented search pages (currently just taxon and behavior).  The API generates SPARQL to query the in-memory Sesame triple store and returns the unprocessed JSON that Sesame generates.  Turning this JSON to results tables is pretty straightforward (though it represents the only ‘production’ JavaScript I’ve written to this point).  The SPARQL query is rather slow, particularly for queries to higher taxonomic groups: queries to Tetragnatha straminea are fast, queries to Arachnida are much slower and definitely not recommended at this time.  I took a long hard look at the SPARQL I had written and found a couple of things I thought would really speed it up – net result was only a 20% speedup for the API call (measured on the browser side using firebug).  I’m pretty sure I need to be traversing the taxonomy more intelligently, maybe using some server side inference.

I also added lexical filtering to terms from NCBI.  There are a lot of ‘taxa’ in NCBI that reflect either samples or unidentified species.   As samples are individuals (or collections of individuals) their identifiers don’t belong in the taxonomy for Arachnolingua’s purposes.  This is completely aside from whether any behavior data was ever collected from these individuals – I tracked down some underlying Genbank submissions and some were from old collaborators and friends in the Maddison lab so there might be some video data for some of these submissions.  The worst offender in sheer quantity was the Barcode of Life and its BOLD submissions.  I know NCBI is planning to move samples to a separate database, I don’t know if this only applies to new records or if existing records are moved, how quickly this will happen.  So best to filter these out consistently in both curation and OWL generation.  Removing these terms did provide some query speedup for the worst cases.

Lots of interesting spider behavior posted to the web in the past few weeks, but this post is already long enough.  I hope my northern hemisphere readers are enjoying the spring as I have.

 

 

Updates, taxonomy, and post-publication review

I’m starting this hoping it will be quick, but thinking about it this morning, I realize there is a lot of ground I could be covering here.  So here goes…

Updates: Last week I tweaked the ethogram (taxonomy) view so that entering the name of a higher level taxon will retrieve behaviors for all included (subsumed) taxa.  This is implemented in the simple, non-elegant way – crawl the tree and retrieve the annotations using SPARQL for navigating, but the control is all implemented in java.  Of course traversing the tree has one advantage over a reasoner query to retrieve all included taxa – the results are guaranteed to come back in some sort of tree traversal order.  It works (try ‘Tetragnatha‘), but it is a bit slow.  I’ve also configured a more capable server, but haven’t deployed it yet, so be patient with these queries (there are some that seems to require 2-3 minutes to complete, I’ll let you figure out which).

Taxonomy: There’s not a lot new to report here – OpenTree has been keeping me busy these past few weeks.  I have been doing some more curation tool work to support taxa outside of NCBI and thanks to Chris Mungall and James Overton, there will soon be a new OWL rendering of the NCBI taxonomy in OWL which should make its way into the backend database soon.  I’m still tracking the addition of Arachnid taxa into NCBI – the majority of updates seem are sample records which won’t help with behavior, new species for ticks and spiders are trickling in as well.

Also, yesterday was Taxonomist Appreciation Day.  Although I have dabbled in taxonomy informatics (TDWG, VTO, a bit in OpenTree, as well as the taxonomy work here) I would never consider myself to be a taxonomist.  I do, as should any biologist, appreciate and thank the generations of taxonomists in the 250+ years since Linnaeus who have brought order and names for the millions of species we share this planet with.

Curation and Post Publication Review: A couple of items I found in twitter over the past few days have struck an interesting thought.  The first was a discussion of how curators of the UniProtKB database deal with changing understanding of the activity of the SiRT-5 protein.  This paper looked at how the UniProt curators responded to a changing understanding of the activity of this protein.  Initially this protein was understood to exhibit deacetylase activity, based primarily on documented activity of other members of the family and some in vitro assays that demonstrated the deacetylase activity. More recent papers have documented that the in vivo activity of this protein is more likely to be succinylation.   The paper describes how annotations in the UniProtKB were modified to incorporate both classes of activity in the appropriate contexts, providing a review process for the earlier reports in high of later results.  Thus the curation process provides a post-publication, albeit specialized, peer review.

This is relevant in light of this post I saw this morning on the likely limits of post-publication peer review.  Now, the particular papers discussed in the UniProt example were published in high profile journals such as Cell and Science, so the particular case does not speak against the 1% notion mentioned in the Dynamic Ecology post.  But not all curation is focused on the sort of topics that make it into the elite 1% of published papers.  My publication database does have a few papers from Science, Nature and one or two other high profile publications.  But the majority come from places such as the Journal of Arachnology, Animal Behavior, or lesser known journals from Japan or Latin America.  This leads me to a somewhat more optimistic conclusion about the future of post-publication peer review than Jeremy Fox.

Next week, I hope to discuss some of the papers I’m in the process of adding and possibly return to the issue of front-end data stores.