Tackling OBO imports

Descriptions need vocabulary, since I am using vocabulary from OBO ontologies, I need to make the terms available for constructing assertions.  Making available means reading them from somewhere, finding the terms, definitions, etc, and either displaying them or reading them from a cache that needs to be updated from time to time.  The NCBO taxonomy is updated daily and although most of their daily updates will not include Arachnids, it still makes sense to be able to force updates on Taxonomy as well as other ontologies, either automatically at each startup (as done by Jim Balhoff’s excellent, but not quite right for present purposes, tool Phenex) or on command, trusting the curator to do the updates from time to time.

For parsing the RDF/XML that remains the default format for the OBO foundry’s rendering of OWL files, I’m using the SAX-like iterparse facility of the lxml python library.  The choice to use lxml comes again from my day job, though most nexml files are, unlike NCBITaxon.owl, small enough to be processed with a DOM parser without blowing out memory.  I’ve tested the parsing and, at least for now, building of a list of classes with both a relatively small ontology (Evidence codes) as well as the OWL serialization of the NCBI Taxonomy.  So far, good enough.

Also made the ontology status page available via a menu item.  The parser (in a module called ontology_tools) is not yet hooked up to the status page, I want to turn the parsing result into something more useful first (probably a db table).  Also added some hooks for linking publication citations on the publication status page back to the editing page for that publication.




Adding some status tools

Most of my work over the past couple of days has been on a publication status tool.  This page lists the publications and for each a list of identified issues for curation.


The buttons in the left trigger a couple of update tools.  The first auto-updates new, derivative fields that can be filled from existing fields.  For example, the original spreadsheet contained a disposition field with free text like ‘Not found’ or ‘Downloaded’.  These are not sufficient for the type of curation and provenance I have in mind, so I’ve added a vocabulary of publication curation status types (the ‘vocabulary’ is currently just a table of ids and strings).  So the update tool tries to set the curation status from the contents of the disposition field.

The second button links to a, as yet unimplemented, tool for checking and updating doi strings.  The plan is to validate existing dois and, if possible, to query CrossRef to fill dois I haven’t already found (either pubs I missed or cases where older publications added dois post hoc).  I’ve run into a couple of places where CrossRef listed multiple dois for what seems to be the same publication, a situation that I was able to resolve manually by checking the publisher’s site, but I’m not planning to deal with the problem of different doi’s across different registrars discussed by Rod Page.

The current status display is simply a list of problems, which can only include, for now, problems with processing the disposition field as described above.  The colored empty lists are a temporary issue I’ll get back to shortly.

Meanwhile, there is also the start of support for pulling vocabulary (e.g., anatomy, taxonomy) from external ontologies.  This will require some OWL parsing as well as parsing the NCBI taxonomy download files (should be easier than parsing OWL).  There is the start of database support as well as a page for ontology sources (e.g., name, url) and a vocabulary of ontology processing types (file formats).  I expect this will follow the lead of the publication status page.

Small updates

Not much activity recently – finished writing up my experience at the Animal Behavior Society meeting, finally back to development.  Starting to implement the publication curation status page – list of publications with identified problems to address.  Also updated the readme and added the license for the owlbuilder tool.  Resisting temptation to continue playing with the site display, but it’s just a temporary thing, so on with the backend tool development and (real) curation.