RIOXX application profile – draft 1

Together with Sheridan Brown, I have been tasked with developing some guidelines and a metadata ‘application’ profile for institutional repositories (IRs) in the UK. We are calling this work RIOXX. This post focusses on the application profile more than the guidelines, and describes phase 1 of the project, which aims to deploy this application profile across IRs in the UK by the first quarter of 2013.

Objectives

  • to develop an application profile which enables open access repositories to expose metadata more consistently and which, in particular, conveys information about how the item being described in the metadata was funded
  • to develop general guidelines for repositories which support the use of the application profile
  • to support such technical development as is necessary to implement these recommendations and the application profile in common repository platforms
  • to develop these such that they pave the way for a likely CERIF-based solution in the medium-long term.

Scope and approach

Funder policy regarding Open Access (OA) is being actively developed and the OA landscape is shifting. The emphasis in this phase of RIOXX is to do something which is adequate and able to be quickly implemented. This work will provide an application profile and guidelines which are inherently an interim solution. Broadly speaking, the approach we are taking is as follows:

Develop the simplest possible application profile, based on Dublin Core (DC).

Pretty much all repositories support DC, as another application profile of DC, OAI-DC, is a mandated minimum metadata format for the ubiquitous protocol for harvesting metadata from repositories (OAI-PMH). If all goes well, the development work needed for repository systems should be minimised.

Consider other, related guidelines

We have examined two related initiatives: the OpenAIRE guidelines (and the Driver guidelines which preceded these), and the EThOS Toolkit which developed an application profile of DC for eTheses.

Consider a CERIF-XML expression of this application profile

The interest in CERIF as the de facto standard format for exchanging this kind of information between systems is growing steadily. We are liaising with the CERIF Support Project and ensuring that a transition towards a CERIF-based approach remains viable.

Develop a modelled, expressive application profile

In later phases of RIOXX, we hope to develop the application profile more fully. This will take into account such things as:
* greater use of controlled vocabularies
* a move away from DC and towards CERIF
* greater involvement of systems other than repositories – notably Current Research Information Systems (CRIS).
* modelling of ‘access-level semantics’ – i.e. describing how, where and under what license or conditions the resource might be accessed and used

Rationale for some decisions in phase 1

Keeping things very simple

Timescales are very, very tight. From a pragmatic, technical point of view we have restricted ourselves in this phase to developing an approach which allows the repository to emit RIOXX records based on information properties already catered for in the repository system (that is, the placeholders for Sponsor and ProjectID already being there, even if the actual data has not yet been entered). We have deferred a more complete and complex approach to a later phase because the capacity to deliver this kind of information from institutional systems is developing rapidly.

The ProjectID property

We found ourselves unable to simply adopt the OpenAIRE guidelines as these mandate a particular syntax for the ProjectID (designed for EC funded projects) which would preclude certain UK funders. In any case, we consider it to be a mistake to embed semantics into this property and believe it is best provided as a globally-unique, opaque identifier. To this end, we are actively looking at the possibility of funders minting DOIs for the ProjectID. In the meantime, we will be requiring that the ProjectID be whatever identifier is provided by the funder of the output being described in the record.
We have chosen the term ProjectID rather than, for example, GrantID, as we have been advised that the former is the more widely used term in common usage in the UK.

The Sponsor property

For phase 1 we are mandating this property, but specifying only that a recognised form of identifier for the funder/sponsor be used. This will mean a free-text string for now. We are actively exploring possibilities for identifying and then mandating a particular authority list of funder names, such that this property becomes underpinned by a controlled vocabulary. However, this will not make it into phase 1.
This property, while essential in the short term, might become more of a convenience than a necessity, as the ProjectID becomes more reliably ‘actionable’. In the medium-term, we would anticipate being able to reliably derive the sponsor/funder from the ProjectID. For this reason, we have not modelled the relationship between these two properties closely – except insofar as they exist in a particular record. This means that some records may contain more than one Sponsor and more than one ProjectID with no direct way to relate a given ProjectID to a given Sponsor. While it would be possible to model this relationship, we have chosen not to do so in this phase, because:

  • it is not the common case that a record would have more than one Sponsor
  • it is more likely that a record might have more than one ProjectID, but only one Sponsor. This happens where a project has multiple versions – such as when the PI moves institution during the project.
  • it is unlikely that current repository systems will be able to provide more richly modelled relationships between these properties without further development
  • it is the common case that a record will have one Sponsor and one ProjectID.

We anticipate that this will need to be modelled more thoroughly in future phases.

Deferring the ‘access-level-semantics’ question

In order to convey the precise nature of the open-access ‘state’ of resource, RIOXX will need to develop a richer way of describing such concepts as ‘green’ or ‘gold’ open access, embargoes, licenses etc. The use-cases and operations which will depend on such information are not yet clear and, while the time has now come to model these, this should not be done in a hurry.

The following is a table of proposed elements and recommended formats. We propose to use extend the Dublin Core elements with two new elements under the rioxterms namespace.

  • M: Mandated
  • R: Recommended
  • O: Optional
Element Inclusion M/R/O Format Format M/R/O
dc:title M Free text. It is recommended to use the form: Title:Subtitle R
dc:creator M Free text. Recommended practice is to either use the form Last Name, First Name(s) or a unique identifier from a recognised system. Each creator should be given a separate dc:creator element R
dc:identifier M A globally unique identifier. It is strongly recommended to use a URI which can be de-referenced (i.e. is ‘actionable’) where this is appropriate R
dc:source M Journal title, reference or ISSN M
dc:language M Use ISO 639-3 language codes M
rioxxterms.projectid M Use the identifier provided by the funder to indicate the project within which this output has been created M
dc:coverage O The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic co-ordinates), temporal period (a period label, date or date range) or jurisdiction (such as a named administrative entity).
dc:rights O No agreed vocabulary or semantics exist for this in the context of Open Access papers, and it is common practice for this to be ignored by repositories currently. Some work is being funded to look at this area for the next phase of RIOXX. For now, this element has to be optional.
dc:audience O Free text.
dc:format R It is recommended to use the IANA registered list of Internet Media Types (MIME types) M
dc:date M One date using ISO 8601. Published date is the default and recommended interpretation. M
dc:type O This is currently free text and an optional element. However, RIOXX phase 1 will be recommending that a vocabulary be adopted or developed for this element. O
dc:contributor O (as for dc:creator)
rioxxterms.sponsor M Free text – Funder name using the funder’s preferred format O
dc:publisher R Free text indicating the name of the publisher (commercial or non-commercial) O
dc:description R Best practice is to use an English language abstract. O
dc:subject R Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. E.g. LOC, MESH. O

I would appreciate any comments people might have about the technical aspects of this.

10 thoughts on “RIOXX application profile – draft 1

  1. Owen Stephens

    Some quick comments on a first reading:

    I wonder if terming this as an application profile for ‘institutional repositories’ is quite right – institutional repositories can cover many different kinds of content, and it feels like this is more in the area of ‘funded research published outputs’ – perhaps worth being specific about what type of materials this application profile is for?

    I found the two columns relating to Format slightly confusing. The Format column is probably conflating two (maybe three) different things. For example for dc.title we have “Free text. It is recommended to use the form: Title:Subtitle” and “R”. I’d suggest that this could be simplified to something like “Title:Subtitle” and “R” – as the fact it is “Free text” isn’t a recommendation?

    Another example is dc.date “One date using ISO 8601. Published date is the default and recommended interpretation.” listed as “M”. Here presumably it is the “One date using ISO 8601″ that is the “Mandatory” bit, and the rest is a recommendation as to the nature of the date expressed.

    Finally on this, several fields are missing any value in the “Format M/R/O” column – I think these should have something (or you should drop the ‘optional’ classification for the format and have M/R or blank)

    dc.date should also state which version of ISO 8601 is to be used (assume 2004?), and I’d probably recommend stating whether this is using basic or extended as well.

    How does rioxxterms.sponsor compare to uketdterms:sponsor (is it possible to make statements about the equivalence or otherwise of these?) Same question about rioxxterms.projectid and uketdterms:grantnumber.

    dc.creator and dc.contributor allow the used of identifiers or textual representations, but other fields like dc.publisher or rioxxterms.sponsor seem that they could equally use identifiers or textual representations, but don’t mention allowing this. Also dc.creator says an identifier ‘from a recognised system’ – this probably needs to be explicit if it is to be enforceable.

    Is dc.audience useful?

    dc.source – why offer Journal Title or ISSN – surely better to use one or the other (or as with dc.creator say ‘identifier in following formats’ (note that this format listed as M – unclear this is meaningful as “reference” has no specific meaning?)

    dc.subject format – says ‘Recommended ….’ but then states this is ‘optional’

    dc.coverage – 3 different types of coverage listed in format – I realise this reflects DC, but this is an appliction profile for RIOXX, so maybe worth considering if this could be more specific, and if there are specific types of coverage that could be split out into specific elements.

    Does there need to be an element to link the research to an institution? Is this covered by ‘sponsor’ (not how I read it – sponsor is putting up the key funding?) Or assumed will be linked from creator/contributor?

    Should there be some description of how dc.creator differs to dc.contributor?

    Sorry – a bit random. Hope some of it is useful

    Reply
    1. paulwalk Post author

      Owen,
      many thanks for these very useful comments. I’ll work through them and respond in sequence:

      1. On the name of the application profile:
      I tend to agree with your comment. The slight problem is that this phase of RIOXX is much more tightly scoped to scholarly papers, where later phases are expected to encompass open access publicly funded research outputs more generally. There is ongoing work to create a workable vocab to describe this. However, your point is well taken and we will find a good way to describe this consistently.

      2. Free text” isn’t a recommendation:
      Quite so! Will amend accordingly

      3. DC:Date and format versus recommended interpretation.
      I agree this could be clearer – it’s not wrong as such but it needs to be more precisely expressed. I’ll think about an alternative presentation

      4. Missing values in the M/R/O
      Yes – but as above, I need to think about this more generally.

      5. dc.date – version of ISO 8601 and basic/extended.
      I’m not sure I’m ready to agree with this – I need to do a little research. Two observations:
      * we have found different approaches in the metadata we have harvested in RepUK, and in this version of RIOXX we’re trying to be permissive where this is viable. Which brings me to the second observation:
      * I hypothesise that most software libraries can take any of these versions of ISO 8601 and derive a date value of sufficient usefulness for the great majority of use-cases. One for a bit of testing / discussion perhaps.

      6. Equivalence or otherwise of rioxxterms and uketdterms:
      Regarding the two rioxxterms introduced so far, I think that they can for practical purposes be considered to be equivalent to their respective uketdterms. However, I’m aware that the relationship between ‘grant’ and ‘project’ is not a simple one in every case – David Shotten convinced me of this a while back. We plan to do some of this modelling in the next phase. For now, I would prefer not to make strong assertions about the relationships here, but to continue to liaise with EThOS and not introduce unnecessary obstacles for interoperability.

      7. The use of identifiers or strings in dc.creator and dc.contributor etc:
      I completely agree with you – I will make this adjustment

      8. Is dc.audience useful?:
      Well, we asked ourselves that too! I just don’t know, possibly not. It’s interesting that the EThOS AP does cover this. We took the decision to mention it because it’s in OAI-DC and is supplied by some repositories. I have a suspicion that it could become useful in future, as the RCs track their investments more and more.

      9. dc.source – why offer Journal Title or ISSN:
      I need to refer this one to my co-author, Sheridan, as I think he had a reason for this recommendation which has temporarily eluded me (sorry, it’s Friday evening and all….). I agree that the ‘M’ is not clear here. Will revise. I shoudl also say that we are preparing a ‘rationale’ document for all of these decisions to be made available with the AP for those who are interested.

      10. dc.subject format
      Will fix.

      11. dc.coverage
      This was pretty much copied from the OpenAIRE guidelines. I think we will revisit this though – your comment confirms my suspicion that this is not yet clear or defensible

      12. Does there need to be an element to link the research to an institution?
      Hmmm. It isn’t a simple relationship. Possibly through dc:contributor(s). Need to consider this – thanks.

      13. Should there be some description of how dc.creator differs to dc.contributor?
      Yes – good suggestion – will do this.

      Thanks again for taking the time – really useful feedback.

      Reply
  2. PeteJ

    I guess it’s a bit of a personal bugbear of mine (and I see it in lots of other contexts too), but from the point of view of a data consumer, I’m not sure how useful it is to know that the dc:date value is probably the publication date, but might not be: all I can reliably conclude is that it’s some date associated in some way with the resource.

    Especially when aggregating data from sources using different “profiles” each with their own “(re-)interpretations” of dc:date, things quickly become very complicated (“From this dataset, dc:date is probably publication date; from that dataset, dc:date is creation date; from that dataset over there, there are dc:date triples for creation date, publication date. and last modified date”). I can’t do much except treat them as dates associated with the resource in some unspecified way. If you really want to expose a date as publication date and have other applications recognise it as such, then I’d suggest that making that explicit in your data by using a suitable more specific subproperty (like dcterms:issued) (either instead of or as well as dc:date) may be a better bet.

    Re “sponsor”, FWIW there is an existing sponsor property in the Library of Congress MARC Relator vocabulary, with a stable URI of http://id.loc.gov/vocabulary/relators/spn and with descriptions made available via that URI using modern Web conventions. Its definition is “Use for a person or organization that issued a contract or under the auspices of which a work has been written, printed, published, etc.”, which sounds a reasonable fit for your purposes?

    Reply
    1. paulwalk Post author

      Pete,
      many thanks for the useful feedback. Responding in sequence:
      1. dc:date and pinning this down more precisely.
      I think I’m persuaded by your general point – that this is seriously undermined in its potential usefulness if it is ambiguous in meaning – and that we probably can take the opportunity to be assertive and state clearly what this date means in this context.
      I’ll go back to the dcterms resource and check dcterms:issued as a possible candidate replacement for dc:date. I think I tend to prefer the ‘instead of’ option rather than ‘as well as’.

      2. The sponsor property in the LoC MARC Relator vocab sounds very interesting – new to me – I will definitely take a look.

      Thanks again for taking the time – really useful suggestions.

      Reply
  3. Emanuil Tolev

    Just a couple of comments:

    1. dc:creator – (just wondering) besides being explicit about “recognised systems” as commented above, what kind of context information would you expect? (e.g. :? ) I suppose I could rephrase this: which systems were you thinking about when you wrote this? I’m just trying to imagine how I would “consume” this data, that’s all.

    Also, I’ve seen dc:creator containing affiliations as well (“Emanuil Tolev (Aberystwyth University”) although that was OER-s. I’d say you want to avoid this as it tends to then NOT be present in dc:contributor or dc:publisher, and then you’ve got no way to reliably facet on organisation names, and that is very often something people want to do (“all resources by my institution…”). The problem is, I don’t know how to avoid this except explicitly recommend that people put their affiliation elsewhere.

    2. dc:format – is the mimetypes list actually being used as a vocab.? (dc:format is recommended as a field, but if there is any content in it, then all values *must* be mimetypes – or is using a mimetype ALSO just a recommendation?)

    3. Glad to see that dc:date will be an ISO date. It’s not that I mind “Autumn 2009″ that much, but it’d be nice to have something that a machine can categorise and visualise easily.

    4. dc:type – DC has something to say about this in terms of vocab.:
    “Recommended best practice is to use a controlled vocabulary such as the DCMI Type Vocabulary [DCMITYPE]” ( http://dublincore.org/documents/dcmi-terms/#elements-type ). Worth looking into making this the recommendation even though you’re on phase 1? (I’m not familiar enough with DC – could this actually be a mandated vocab. or do you generally want to avoid being strict with vocabularies?)

    5. Are people going to be encouraged/discouraged/forbidden from using specialisations of the fields e.g. “dc.contributor.author”?

    Apologies if this sounds a bit random – I am still getting my feet wet in this area, but expect that I’ll have to deal with metadata which adheres to this spec. you’re working on, and am interested from a “how easy is it going to be to ingest, visualise and otherwise automatically process” PoV.

    Finally, it might be worth mentioning that RIOXX stands for “Repository Interoperabilty Opportunities Extended” (I presume?), I couldn’t even find it on http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/repositories/rioxx.aspx :) .

    Reply
    1. Emanuil Tolev

      My example for #1 got parsed, forgot I couldn’t do angle brackets here.

      So, what context info are you expecting for “recognised system” info in dc:creator – sth. like {recognised_system}:{identifier_within_system}?

      Reply
  4. paulwalk Post author

    Emanuil,
    thanks for the very useful comments. Taking each point in turn:

    1. Really, we are trying to anticipate systems such as ORCID becoming used more comprehensively. In the case of ORCID specifically, a dc:creator’s ID should be expressed as an HTTP URI which of course gives the benefit of expressing the system (HTTP), the ‘service’ (orcid.org) and the ID in one easy string. It may not be the case that all possible identifier schemes for creators will allow this, so we cannot be ttoo prescriptive here, but we can, and will, give guidance. For example, using the ORCID example, we should guide people to express the ID as:

    http://orcid.org/0000-0003-1541-5631

    rather than as simply:

    0000-0003-1541-5631

    2. I’m still a little undecided on this. I would like to mandate a mime-type, as a controlled vocab, however it is probably not going to be mandated in this phase. Too many repos do not use this yet and there would need to be some tooling first. However, we are about to kick off a project to help decide vocabs across the community and this will be considered there – my predictoin is that this will be mandated in the next version.

    3. Indeed!

    4. This is a god point. There is already work going on in the Research Councils to characterise the ‘types’ of outputs being generated through public funding and this is, inevitably, going to become a richly expressive vocab. I think it may already be too late to usefully apply a simple vocab at this point. However, I suppose we ought to provide some suggestions in the meantime, and the DC vocab has the considerable virtue of simplicity. I need to think about this further – thanks for pointing it up.

    5. I think we are discouraging further qualifications of the elements in this phase, however perhaps these can be allowed so long as the unqualified element is also provided. E.g. , you can qualify dc:creator.author so long as you also provide dc:creator directly for the parsers which ignore the qualifier. It isn’t something we would encourage at this point – one of the principles behind this work is for it to be very simple and rapidly implementable.

    As for RIOXX – it no longer stands for anything really – it’s just a unique identifier ;-)

    Thanks again!

    Paul

    Reply
  5. Ben Ryan

    re. discussion above on whether ‘audience’ is useful, the answer from a Research Council perspective is (as Paul suspected) ‘yes’.

    It might/might not be helpful to know that the Research Outcomes System used by AHRC, BBSRC, EPSRCR, ESRC, (and from 2013 NERC), tracks target audience in two ways:

    1. For Publications, there is an optional question: ‘Non-academic audience?’ with options ‘Yes’ and ‘No’)

    2. for Dissemination/communication activites, there is the option to select from a drop-down list from which multiple selections can be made:
    Schools/students
    Schools/teachers
    Potential/actual post-grad students
    Participants in research
    Media
    Policymakers/parliamentarians
    Industry
    Public
    Other
    and there is also an optional question ‘International audience?’ with options ‘Yes’ and ‘No’

    Reply
  6. Pingback: ROS User Forum | CERIF in Action

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>