Why Arachnolingua will not use dc:terms to describe publications

Along with all the buzz and excitement (and over loaded servers) that accompanied the release of the OpenTree PNAS paper, there was a question about why OpenTree was not publishing RDF.  Jonathan Rees ably answered the question here (I refuse to quibble about the implication that I don’t know RDF), but the gist was that it could have been done, but RDF is easy to do badly, but hard to do right.  This is very true, and led me to reconsider, not whether to generate OWL, but how well I was living up to best practices.  Arachnolingua is not really an ontology and certainly doesn’t aspire to OBO foundry status, but many of the practices they recommend are worth following.  All the code and data are available, standard vocabularies are used whenever possible, and I will open this up to wider collaboration once things have settled down.

About the same time that the OpenTree paper came out, I was considering the implementation of landing pages for publications, as well as terms in the ARACHB namespace.  I am implementing a service that will provide a page for all terms of the form http://arachb.org/ARAIMG_20150909_080000443CHB/ARACHB_XXXXXXX.  These include generated identifiers for claims and individuals (more about that in a later post) as well as taxa without an NCBI id and publications without a doi.   The landing pages will be minimal to start and will be available as json as well as html.   There will be landing pages for publications with a doi as well, simply because there are things that Arachnolinuga has to say about a publication apart from what is on the publisher’s page that a doi generally resolves to (e.g., number of curated claims, individuals, etc.).

It is reasonable to expect that a landing page for publications would include the standard bibliographic information, especially for publications that lack a stable identifier (doi).  The information is available in the arachadmin database.  I originally thought to just generate pages, later to generate json files for each identifier with an html template, then realized that there was no compelling reason not to add this to the individual representing each publication.

Of course at this point, it would seem obvious to use the Dublin Core vocabulary.  But from my experience on the Dryad project as well as a discussion with Jonathan about two years ago made me suspicious.  I was already using the OBO IAO (information artifact ontology) so I started looking at IAO as an alternative to Dublin Core (DC).  Not a lot has been written about this, but Barry Smith has had something to say in this powerpoint.  Smith’s presentation makes it clear that while IAO is an ontology of information artifacts, DC is a vocabulary for tagging artifacts.  But looking at the set of properties provided by DC, it is less a vocabulary about how to find an artifact and more about rights management.  It looks like the DC vocabulary is used in places where it can be applied (e.g., dc:title, dc:contributor, dc:author) but for location, it merely provides a place to provide a link, not a bibliographic reference.  This matches my experience with both Dryad and Opentree – in the former we used DC terms where possible, and other vocabularies (e.g., prism) for other properties and in Opentree the DC vocabulary is not used at all, in favor of leaving the publication citation as an unparsed string, breaking out only the publication year, and using the ot namespace defined for the jsonized NeXML that studies are stored in.

In my searches, I found one project that, instead of using dc terms in the rdf, transformed all the dc properties into rdf:comment strings.  I plan to implement something similar, but more focused.  The bibliographic information will simply be turned into a json string that will appear as an rdf:comment on the publication individual.  This avoids needing to import the DC vocabulary or define a namespace for what are ultimately annotation strings.  The json will make page generation easy (the json page will simply be the untransformed comment string) and its meaning will be clear to anyone inspecting it.

Announcement: There is an update of Arachnolingua coming.  Preview soon at http://dev.arachb.org/


Trouble with doi’s

My first attempt with Arachnolingua was building the knowledge base directly in OWL using Protege.  Protege is a very good ontology editor and because the OWL support in Protege 4+ is based on the OWLAPI, which I have and continue to use (one of my main reasons for continuing to use Java), it should play well with owlbuilder which uses the OWLAPI to store and generate the ontology files.  Unfortunately Protege isn’t very good with doi’s.  It seemed to do better with http: style doi’s (e.g. http://dx.doi.org/10&#8230;. vs. doi:10…), at least for naming the individuals for each publication (and declaring them to be publications).  However, when I attempt to refer to them in an assertion that a statement about behavior is part_of a containing publication, Protege complained on loading that the URI had no recognizable prefix.  Defining doi: as a prefix didn’t seem to resolve the complaint, so I went back and generated doi: prefixed URIs.  It seemed to accept this, but display of the individuals within Protege as <doi:10…> suggests it isn’t entirely happy with this.  I’m not sure if I have everything correctly (un)escaped in these new doi’s either.  Not the highest priority in the big picture, but I’ll try to resolve it before adding more logic (taxonomy will be next).