Advantages and disadvantages; capturing the big picture

The advantage of rolling your own curation tool is that you can make it work exactly the way you want to match your workflow.  The disadvantage (naturally) is that you have to implement the parts that match your workflow.   I spent some time over the weekend with the plan to do some more curation (to have something to share at the Phenotype RCN meeting this week), but this quickly turned to fixing a number of gaps and some minor layout redesign on the assertion (annotation) entry page.   Probably the most import items in terms of workflow were redirecting the browser after a assertion or participant were entered, rather than simply returning to a blank page.   This was pretty straightforward as the page already had the ability to parse URL arguments and load the appropriate assertion and participant records into the forms.  I also added a button at the bottom of the form to go directly to a blank form for adding the next assertion (previously you needed to go back to the home page and use the enter assertion button in the left column).

temp new_assertion_window


I also added links to the appropriate assertion pages from the participant list table and the assertion list page, which now shows pidgin OWL and a publication reference string rather than id numbers.  The link from the list shows the assertion and the primary participant.

The other issue I’ve been thinking about came up when I started reviewing the Aiken and Coyle (2000) annotations.  One of the most interesting behavioral observations in the paper was the multiple methods of prey capture in Tetagnatha straminea (both web capture but also direct capture using leg spines).  Clearly the existing annotations fail to capture this.  Part of this is the general difficulty of describing negatives, in description logic in general and more particularly in OWL-EL.  However, the more immediate problem was that there are two assertions of prey capture and the notion that one involves web + legs as participants and the other involves only the legs is missing.  The necessity of using predator behavior as the behavior tag only compounds the problem.


Refining the curation tools: making the assertion page usable.

I spent sometime last night working on the most important or at least most central page in arachadmin curation tool – assertion editing.  This page, shown below, brings together assertions (behavior, publication, evidence) with the associated participants (taxon+anatomy, or substrate – living or non-living portions of the environment).  It’s a bit of a mess as I smooth out the rough edges with my imperfect understanding of web2py and its forms, but I made a couple of improvements last night in the list of associated participants.  The list now appears under the assertion form rather than the participant form.  This change was facilitated by removing a couple of fields that are more properly in the (primary) participant; this shortened the form and freed up some space to move the participant list (and it seems more appropriate to group the list with its owner rather than one of its members).  The other change is that the ‘pidgin functional owl’ description of the participant now has a link that redirects to the assertion page with the appropriate participant (and assertion) filled in.  Prior to this the page was difficult to navigate without typing messy URLs with id values obtained by manually querying the database.  That was sufficient for the first nine entries, but it clearly won’t be enough for the thousands to come.



It’s Darwin Day – I’ll celebrate by updating the site

I’ve spent a couple of hours cleaning up the arachnolingua site.  In particular, the link to the old ontology file is gone from the top menu bar – that file hasn’t been valid in nearly a year and even the current file doesn’t deserve that level of billing.  If you want to see the OWL file, there is a link on the project page (available from the ‘about’ menu item).  The new OWL file is, I’m happy to announce, a reasoned over version, with inferred axioms covering  inferences of subclass, equivalent class, and class assertion being added to the ontology.   ELK didn’t complain (unlike when it reasons over the mashup of supporting ontologies) and a quick inspection in Protege showed no apparent changes when the resulting ontology was re-reasoned with FACT++ (unlike when the unreasoned ontology was loaded – several top level classes disappeared after reasoning).  The lack of improvement resulting from reasoning in Protege doesn’t say much about FACT++ vs. ELK, simply that the KB is expressively simple enough that it is covered by the portion of OWL-EL that ELK reasons over.  It’s also still quite small, so the speed advantage of ELK isn’t very obvious in the second pass.  Hopefully arachb will keep growing past the point where this is no longer the case.

The project page has also been updated to link to the source ontologies directly rather than the MIREOT’ed owl files that went with the old KB file.  Perhaps these and other questionable links have something to do with all the hits on the projects page (and only the projects page) from Russian sources that my logs the past week or so have been showing.

The other big change is the attention I’ve given to the taxonomy status page.  First, a word of explanation: the taxonomy status page is a list of taxa I’ve encountered during curation (so the publications listed on this page have undergone some review and behavior annotation and will likely be the next ones to appear in the KB) that do not appear in the NCBI taxonomy.  The list here is over a year old, but none of the names listed here have appeared in the intervening months.  In some cases (e.g., synonymy) this is to be expected.  In any case, I’m using the World Spider Catalog as my authority for cases where names don’t appear in NCBI.  Because the WSC text pages don’t contain full species names and lists of genera are split by family, I’ve used wikipedia and EoL to get family names for these taxa.  Looks like this may trigger some EoL contributions on my part.

In any case, enjoy and feel free to comment on what you see, suggest papers to review (or bump up the queue) or any other suggestions.




A bit of curation

Over the weekend I spent some time messing with publications and merging author strings and made some progress but decided it wasn’t as much of a priority as shaking down the existing system by adding some more annotations.  So last night I added 7 more annotations from Aiken and Coyle’s 2000 Tetragnatha survey.  This added variety in behavior, taxonomy and anatomy as well as forcing me to confront the lack of vocabulary for substrates.  In particular there are annotations for prey handling and wrapping.  Since the paper doesn’t explicitly identify the prey involved, assuming anything beyond arthropod (which isn’t even 100% certain), it may be best to identify a term for prey from an ecological ontology (maybe something to pursue at the PCO workshop in Tucson I’ll be attending in two weeks).

I ran into a couple of minor problems, one of which appeared as a crash caused by a publication that had an empty string rather than a NULL in its database DOI field.  The more interesting fix was to add the code from pulling in parents and annotations from anatomy terms (I missed that in the first pass and just got lucky with the two anatomy terms I used).  This revealed itself when I loaded the OWL file into Protege and to see that ‘whole organism’ was no longer labeled (it appeared by its OBO identifier).  In retrospect this is a little strange, since ‘whole organism’ was one of two anatomy terms that were used with the first two annotations, however, adding the parent and annotation (e.g., rdfs:label) extraction seems to have resolved the problem.

Yesterday I got a query on this blog about my use of the ELK reasoner.  I paid a bit more attention to what it was reporting last night.  It is currently run over the merge of all the support ontologies (which constitutes the import closure of the ontologies that actually get used in annotation) and allows querying of the subsuming (parent) classes and, in the case of taxonomy, the subsumed (child) classes.  These queries determine what will be pulled into the target, which is represented as a separate OWL ontology.  Now this fairly large collection of 11 ontologies cover a range of expressiveness from AL (what Protege calls the base attributive language) through to a number of ontologies that Protege reports as SIQ.  Apparently these are complex enough to cause ELK to complain, but ignore a number of axioms.  For the present purpose the reasoning is sufficient even if it may be incomplete.

What I haven’t done yet, and probably should, is run ELK across the generated target KB and see whether there are problems.  I noticed that running FACT++ within Protege did reduce the number of root concepts when I reviewed the output of OWLBuilder, so there is something to be gained.

In any case, there are now short ethogram listings for the genera Tetragnatha and Deinopis, as well as some new annotations for T. straminea.  I’m not sure whether to tackle improved display for the ethogram table or try a behavior catalog (behavior hierarchy with taxon counts) or a anatomical catalog yet.

Also, the response time is starting to be noticable (generally 1.5-2 sec for ethogram queries).  It may be that speed will bump me up to the next AWS tier, while I had expected memory to be the constraint, but I’m not ready to make that move yet.  Stay tuned.



Back to curation

So, now that I’ve built the skeleton of a pipeline, I’m going back to the start to work on building it up.  First stop is the curation tool (arachadmin) and processing publications.  Up until now all the author information has been stored as an undecoded string more or less directly from the citation (e.g., “Last, first; Last, first; Last, initials”).  The tool does chew this up a bit in the publication list display, but none of that is getting stored or passed down the pipeline.  During the past week I’ve finished implementing an author table, that will include assigned ids (intended to be ORCIDs for authors that have them) or generated ids.  For the record, I do have an ORCID, it’s  0000-0001-6512-3296   though the information there is a bit out of date (try me on Impact Story, which I find easier to update).

A perennial problem with maintaining citations is the multiple ways an author’s name can appear.  To manage this, I’ve been implementing a simple system to allow me to merge the various ways an author’s name appears.  I’ve tried to design something that is flexible, particularly that it can be reversed (for example if there are two authors with the same last name and first initial, these can be separated if this is discovered after a merge).

The underlying database changes to support this involve tables for authors, merge records (allowing the original usages to be extracted), and a table of authors, publications, and listing orders (e.g., first author = 1, etc.).  The tables are there, but currently only the authors table is filled.  The merge information can be generated and after a number of hours of fiddling with forms and tables within forms, there is now a page for displaying candidate merges (generated on the fly, since the database has not been populated) and the results from the form are being throw at a particular page which is ignoring them.

Once the merging is taken care of, the next step will be to push these to owlbuilder with the promise of a proper publication page for the main webapp.

On the horizon is PostgreSQL – the day job is considering switching over from mysql to postgresql and I would be happy to follow them.  I got a fair amount of postgresql experience from my time with Phenoscape, and the future of mysql has held a bit of uncertainty ever since Oracle bought it along with the rest of Sun.  The experience will help since administration of postgresql is a bit more involved than mysql, for which administration is much closer to ‘simply working.’