So, now that I’ve built the skeleton of a pipeline, I’m going back to the start to work on building it up. First stop is the curation tool (arachadmin) and processing publications. Up until now all the author information has been stored as an undecoded string more or less directly from the citation (e.g., “Last, first; Last, first; Last, initials”). The tool does chew this up a bit in the publication list display, but none of that is getting stored or passed down the pipeline. During the past week I’ve finished implementing an author table, that will include assigned ids (intended to be ORCIDs for authors that have them) or generated ids. For the record, I do have an ORCID, it’s 0000-0001-6512-3296 though the information there is a bit out of date (try me on Impact Story, which I find easier to update).
A perennial problem with maintaining citations is the multiple ways an author’s name can appear. To manage this, I’ve been implementing a simple system to allow me to merge the various ways an author’s name appears. I’ve tried to design something that is flexible, particularly that it can be reversed (for example if there are two authors with the same last name and first initial, these can be separated if this is discovered after a merge).
The underlying database changes to support this involve tables for authors, merge records (allowing the original usages to be extracted), and a table of authors, publications, and listing orders (e.g., first author = 1, etc.). The tables are there, but currently only the authors table is filled. The merge information can be generated and after a number of hours of fiddling with forms and tables within forms, there is now a page for displaying candidate merges (generated on the fly, since the database has not been populated) and the results from the form are being throw at a particular page which is ignoring them.
Once the merging is taken care of, the next step will be to push these to owlbuilder with the promise of a proper publication page for the main webapp.
On the horizon is PostgreSQL – the day job is considering switching over from mysql to postgresql and I would be happy to follow them. I got a fair amount of postgresql experience from my time with Phenoscape, and the future of mysql has held a bit of uncertainty ever since Oracle bought it along with the rest of Sun. The experience will help since administration of postgresql is a bit more involved than mysql, for which administration is much closer to ‘simply working.’