Call for feedback to the ResourceSync specification for synchronisation of web resources

Resourcesync logoI have been slightly involved (through Jisc funding) with the ResourceSync specification project, being led by Herbert Van de Sompel of the Los Alamos National Laboratory. The project has just released a draft specification, which is available at http://www.openarchives.org/rs/.

The draft will be available for public comment until March 15th 2013 – you are invited to comment via the ResourceSync Google Group. Group discussions are openly accessible; posting requires group membership.

In Herbert’s words:

The ResourceSync specification describes a synchronisation framework for the web that consists of various capabilities that allow third party systems to remain synchronised with a server’s evolving resources. The capabilities may be combined in a modular manner to meet local or community requirements. The specification also describes how a server can advertise the synchronisation capabilities it supports and how third party systems can discover this information. The document formats used in the synchronisation framework are based on the widely adopted Sitemap protocol.

ResourceSync is a collaboration between the National Information Standardization Organization (NISO) and the Open Archives Initiative (OAI). It is funded by the Alfred P. Sloan Foundation and Jisc.

RIOXX application profile – draft 1

Together with Sheridan Brown, I have been tasked with developing some guidelines and a metadata ‘application’ profile for institutional repositories (IRs) in the UK. We are calling this work RIOXX. This post focusses on the application profile more than the guidelines, and describes phase 1 of the project, which aims to deploy this application profile across IRs in the UK by the first quarter of 2013.

Objectives

  • to develop an application profile which enables open access repositories to expose metadata more consistently and which, in particular, conveys information about how the item being described in the metadata was funded
  • to develop general guidelines for repositories which support the use of the application profile
  • to support such technical development as is necessary to implement these recommendations and the application profile in common repository platforms
  • to develop these such that they pave the way for a likely CERIF-based solution in the medium-long term.

Scope and approach

Funder policy regarding Open Access (OA) is being actively developed and the OA landscape is shifting. The emphasis in this phase of RIOXX is to do something which is adequate and able to be quickly implemented. This work will provide an application profile and guidelines which are inherently an interim solution. Broadly speaking, the approach we are taking is as follows:

Develop the simplest possible application profile, based on Dublin Core (DC).

Pretty much all repositories support DC, as another application profile of DC, OAI-DC, is a mandated minimum metadata format for the ubiquitous protocol for harvesting metadata from repositories (OAI-PMH). If all goes well, the development work needed for repository systems should be minimised.

Consider other, related guidelines

We have examined two related initiatives: the OpenAIRE guidelines (and the Driver guidelines which preceded these), and the EThOS Toolkit which developed an application profile of DC for eTheses.

Consider a CERIF-XML expression of this application profile

The interest in CERIF as the de facto standard format for exchanging this kind of information between systems is growing steadily. We are liaising with the CERIF Support Project and ensuring that a transition towards a CERIF-based approach remains viable.

Develop a modelled, expressive application profile

In later phases of RIOXX, we hope to develop the application profile more fully. This will take into account such things as:
* greater use of controlled vocabularies
* a move away from DC and towards CERIF
* greater involvement of systems other than repositories – notably Current Research Information Systems (CRIS).
* modelling of ‘access-level semantics’ – i.e. describing how, where and under what license or conditions the resource might be accessed and used

Rationale for some decisions in phase 1

Keeping things very simple

Timescales are very, very tight. From a pragmatic, technical point of view we have restricted ourselves in this phase to developing an approach which allows the repository to emit RIOXX records based on information properties already catered for in the repository system (that is, the placeholders for Sponsor and ProjectID already being there, even if the actual data has not yet been entered). We have deferred a more complete and complex approach to a later phase because the capacity to deliver this kind of information from institutional systems is developing rapidly.

The ProjectID property

We found ourselves unable to simply adopt the OpenAIRE guidelines as these mandate a particular syntax for the ProjectID (designed for EC funded projects) which would preclude certain UK funders. In any case, we consider it to be a mistake to embed semantics into this property and believe it is best provided as a globally-unique, opaque identifier. To this end, we are actively looking at the possibility of funders minting DOIs for the ProjectID. In the meantime, we will be requiring that the ProjectID be whatever identifier is provided by the funder of the output being described in the record.
We have chosen the term ProjectID rather than, for example, GrantID, as we have been advised that the former is the more widely used term in common usage in the UK.

The Sponsor property

For phase 1 we are mandating this property, but specifying only that a recognised form of identifier for the funder/sponsor be used. This will mean a free-text string for now. We are actively exploring possibilities for identifying and then mandating a particular authority list of funder names, such that this property becomes underpinned by a controlled vocabulary. However, this will not make it into phase 1.
This property, while essential in the short term, might become more of a convenience than a necessity, as the ProjectID becomes more reliably ‘actionable’. In the medium-term, we would anticipate being able to reliably derive the sponsor/funder from the ProjectID. For this reason, we have not modelled the relationship between these two properties closely – except insofar as they exist in a particular record. This means that some records may contain more than one Sponsor and more than one ProjectID with no direct way to relate a given ProjectID to a given Sponsor. While it would be possible to model this relationship, we have chosen not to do so in this phase, because:

  • it is not the common case that a record would have more than one Sponsor
  • it is more likely that a record might have more than one ProjectID, but only one Sponsor. This happens where a project has multiple versions – such as when the PI moves institution during the project.
  • it is unlikely that current repository systems will be able to provide more richly modelled relationships between these properties without further development
  • it is the common case that a record will have one Sponsor and one ProjectID.

We anticipate that this will need to be modelled more thoroughly in future phases.

Deferring the ‘access-level-semantics’ question

In order to convey the precise nature of the open-access ‘state’ of resource, RIOXX will need to develop a richer way of describing such concepts as ‘green’ or ‘gold’ open access, embargoes, licenses etc. The use-cases and operations which will depend on such information are not yet clear and, while the time has now come to model these, this should not be done in a hurry.

The following is a table of proposed elements and recommended formats. We propose to use extend the Dublin Core elements with two new elements under the rioxterms namespace.

  • M: Mandated
  • R: Recommended
  • O: Optional
Element Inclusion M/R/O Format Format M/R/O
dc:title M Free text. It is recommended to use the form: Title:Subtitle R
dc:creator M Free text. Recommended practice is to either use the form Last Name, First Name(s) or a unique identifier from a recognised system. Each creator should be given a separate dc:creator element R
dc:identifier M A globally unique identifier. It is strongly recommended to use a URI which can be de-referenced (i.e. is ‘actionable’) where this is appropriate R
dc:source M Journal title, reference or ISSN M
dc:language M Use ISO 639-3 language codes M
rioxxterms.projectid M Use the identifier provided by the funder to indicate the project within which this output has been created M
dc:coverage O The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic co-ordinates), temporal period (a period label, date or date range) or jurisdiction (such as a named administrative entity).
dc:rights O No agreed vocabulary or semantics exist for this in the context of Open Access papers, and it is common practice for this to be ignored by repositories currently. Some work is being funded to look at this area for the next phase of RIOXX. For now, this element has to be optional.
dc:audience O Free text.
dc:format R It is recommended to use the IANA registered list of Internet Media Types (MIME types) M
dc:date M One date using ISO 8601. Published date is the default and recommended interpretation. M
dc:type O This is currently free text and an optional element. However, RIOXX phase 1 will be recommending that a vocabulary be adopted or developed for this element. O
dc:contributor O (as for dc:creator)
rioxxterms.sponsor M Free text – Funder name using the funder’s preferred format O
dc:publisher R Free text indicating the name of the publisher (commercial or non-commercial) O
dc:description R Best practice is to use an English language abstract. O
dc:subject R Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. E.g. LOC, MESH. O

I would appreciate any comments people might have about the technical aspects of this.

Web preservation – a minor anecdote

I have recently resurrected a domain I used to use actively – sockdrawer.org. I started blogging on this site in about 2003 and stopped using it in about 2007. I only started using it again this week because I needed a free domain name and I discovered I was still paying for this one….

Anyway, having installed a new web-server which listens on www.sockdrawer.org, I had cause to examine the server logs. I was surprised to find this line:

[Sun May 20 09:06:21 2012] [error] File does not exist: /opt/web/sockdrawer.org/public/blog, referer: http://www.jroller.com/rickard/entry/word_to_html_in_java

As my server has only been up for 24 hours after a 5 year hiatus, this suggests immediate evidence of some interest, however limited, in some of the content that was once here. And I don’t have an archive :-(

I have actually found this missing resource – a blog-post – archived on the Internet Archive, and there are a few other resources there from the same blog/website (the majority are gone for good I imagine). I could, if I thought it worth the effort, rebuild the original resources from the Internet Archived versions.

All of this has, at least, made me think a little about web preservation at a personal level.

Library systems of the future

Edit: The presentation I gave to accompany this post is available on Slideshare

I was asked by Ben Showers of the JISC to write a ‘challenging and provocative vision’ for library management systems, for a joint JISC / SCONUL workshop. I was given a free hand with this – the only parameters were that the piece should be non more than a side of A4 paper in length, and that it should use 2020 as its target year for prediction. I think I ignored both of these restrictions, but I had fun and it did provoke some discussion….

Dramatis personae:

  • Alby, a young student & researcher in full time employment
  • Charlotte, a venerable librarian
  • Bob, Dan and Eva, semi-autonomous software agents

Following the unprecedented Conservative ‘walk-over’ election victory of 2015 and the subsequent consolidation in 2019, the landscape of higher-education in the UK is all but unrecognisable. The free market dominates the buying and selling of courses, and the provisioning of learning and research resources has, in the end, simply had to follow suit. Copyright has mostly been ‘fixed’ in the virtual world through a combination of an adjustment to more modest expectations of compensation for copyright holders, workable systems to control distribution, and global agreements allowing extradition and prosecution.

The student researcher (1)

Alby works, full time, as a software engineer. As part of his job, he is given some time to pursue research topics of interest to him and to his employer. His firm gives him a small budget to support this. In the evenings he studies part-time for the new Masters++ qualification. He is enrolled at three universities, visiting one of the these – the local George Osbourne University (GOU) – every Thursday evening. He finances all of this himself.

On Monday evening, when Alby gets home, he goes straight to his laptop and works through all the notes he has dictated into his smart phone during the day. He has become interested in the evolution of library systems and wants to register this interest on the Research Interest Grid (RIG). While recording notes into his phone, he has also published some of these into StreamingConscious, the latest social network to become popular with researchers, and has gained a few new connections from people with aligned interests, including a promising one with a subject librarian at GOU.

Alby then invokes his Foraging Agent, ‘Bob’. A license for Bob was given to him by a publisher, Coyote, which specialises in resources for software engineers, in return for sending him a steady stream of advertisements. Alby adopted Bob because he liked its interface, but he suspects it has in-built biases towards certain, commercial information sources. He believes that he compensates for this by carefully defining his research questions in Research Question Format (RQF) and filtering the results.

Bob runs constantly on Alby’s ‘slice’ – a portion of Personal Cloud (PC) infrastructure provided by a well-known supermarket chain. After a series of questions and answers, Bob is armed with three carefully RQF research questions, and a set of parameters, such as when to report back, and how much of Alby’s research budget to spend on a single transaction before asking him for approval. Bob has learned through observation how Alby likes to work. It knows him in a sense, enough to represent his interests when dealing with other agents. Alby then instructs Bob to begin searching, negotiating and shopping for answers, leads and recommendations, while he gets on with some reading. Alby has grown to trust Bob.

The Librarian

Charlotte is a subject librarian with many years’ experience (she tried to retire 3 year ago but has been forced to come back to work), specialising in software & systems engineering, and currently working for George Osbourne University. On Tuesday morning she checks the reports from her Listening Agents over breakfast. She controls several agents running on the library’s slice of the GOU cloud.

Bob, an agent representing someone called Alby has made contact, coincidentally, with two of her agents – one which represents GOU and which reports to her, and her own personal agent, Eva. Only yesterday, BirdSong (a social network monitoring agent) had suggested that she connect with @alby on StreamingConscious based on their mutual interest in the history of LMS systems. Charlotte’s interest in LMS systems is partly fuelled by nostalgia – she has been working with such systems for more than thirty years.

She sees that Dan, the GOU agent, has supplied Bob with material to which Alby is automatically entitled, and has automatically reserved two books from the local GOU collection for him. In so doing, Dan recommends to Charlotte the purchase of a newer edition of one of these textbooks.

Dan has also made a number of offers to Bob of more restricted material which can be supplied at a cost, including 3 inter-library-loans. Bob has accepted one of these paid-for items on Alby’s behalf and Charlotte is happy to see that it has also observed the protocol of explaining why it has not accepted the others. In one case, she sees that Bob was successful in bidding on eBay for a second-hand copy of a book which Dan had offered as an ILL. Bob has also made an offer to Dan for ownership of the book, once Alby has finished with it, in return for one free ILL. Dan needs Charlotte to approve this. However, she declines, knowing the book to be flawed, despite its 4 star popularity rating. Dan registers this decision, quietly blacklists the book against any future recommendation, and reports this decision to Bob.

Dan notes that Bob has also registered a second book on Alby’s personal virtual book-shelf and indicated a willingness to make this available to the GOU circulation agent for loan to other GOU students as part of the ‘Support Your Library’ protocol, in return for one free ILL token. Charlotte accepts this offer.

Charlotte instructs Dan to negotiate with Bob to arrange a meeting over coffee for Alby and herself. She does this partly because Eva has separately registered Alby’s interest on the RIG and it seems worthwhile meeting with Alby in person to discuss his research. She decides to investigate a couple of other suggestions thrown up by Dan in the meantime. She also notes that Dan has suggested a couple of other contacts to Bob – other people who are enrolled at GOU and whom Alby may wish to befriend on StreamingConscious – – as part of a strategy to reinforce the local GOU social network of students and researchers.

The student researcher (2)

Later on Tuesday morning, Alby wakes to find an interesting report from Bob waiting for him. He discovers he is the proud owner of a new book on LMS system design and is pleased to note that it has a four star rating – one star above the threshold he has set in Bob’s book-buying decision parameters.

Bob has, inevitably, also turned up a few offers of information and resources from the ‘invisible market’. He knows that if you have the right connections, you can get just about any book in ePub5 format. The penalties for possession of an illegally obtained, copyright resource are stiff however. Although it is not illegal, he is also a little wary of using Turpin, the global federation of Open Access papers and other resources, as he has been culturally conditioned to be suspicious of things which appear to be ‘free’.

He also finds a tentative appointment in his diary for coffee with @charlotte, the subject librarian with whom he connected yesterday on StreamingConscious. As he works close by the university, he accepts the appointment. He can pick up his reservations while he’s there.

Face to face, later that morning

Alby finally puts his pen down, and takes a swig of his coffee. He has been writing furiously for half an hour. Charlotte has just taken him on a whirlwind tour of the evolution of the LMS.

She has described how the library has learned, over the last decade, that client relationship management (CRM) is crucial to its mission. Adjusting to the new realities of social networking and global search, the LMS has become a distributed and loosely-coupled collection of processes, all designed to help connect people with resources and with each other.

Alby learns how the rapid introduction of semi-autonomous software agents into research practice took many by surprise. Although the concepts were not new, and much of the technology existed in one form or another, it took the confluence of a number of factors to finally introduce agent-mediated research:

  • the cultural acceptance of an ‘always online’ culture brought about through the ubiquity of smart phones, the prevalence of global social networks and move from the desktop to cloud-based processes
  • the utter complexity of negotiating through ‘permission stacks’ to determine whether or not an individual has the rights to access a given resource in a given context
  • the complexity of relationships between individuals and institutions

Charlotte explains how, from having been a destination for local researchers, the LMS has dissolved into the fabric of a vast, distributed network of research interests, library collections, national, private and open resources.

While the curation of local collections remains important, the facilitation of networking, and the handling of transactions, both social and financial, has taken over as the focus of the LMS. She points out that where once it was quite easy to point to the LMS – at least as a line in a budget sheet – it has become somewhat nebulous on recent years. The LMS has become the coffee-shop of cyberspace, where software agents meet to compare notes, register interests, make deals….

Taking a sip of her peppermint tea, Charlotte sighs as she remembers how simple it all once was.

Responsive innovation – change management in a recession

Back in August I gave a short presentation to the JISC Innovation Group about the DevCSI project, introducing some ideas about possible future directions. The DevCSI project is a JISC-funded initiative designed to work directly with (software) developers in Higher Education through the general approach of encouraging them to establish a community or peers, sharing knowledge, experience, code etc. An aspect of this which has emerged during the first year of the project is the potential value in peer-training – where one developer trains a few of their peers. By supporting this kind of activity as an ‘add-on’ to larger events, we seem to have hit on a way to deliver extremely cost-effective training to (and, importantly, by) the sector’s developers (we’ve done some work to calculate the financial value of this). DevCSI, then, provides a channel through which the sector, represented by JISC and UKOLN, can invest in its developers.

In recent years, JISC has invested in some development programmes based around an approach labelled Rapid Innovation. Rapid Innovation, in this context, described an approach of investment in small, short, cheap development projects designed to ‘scratch an itch’. There was more than an echo of the Agile Manifesto in this approach. The Rapid Innovation projects tended to show the following characteristics:

  • they brought developers more to the fore
  • they produced lighter, more frequent documentation
  • they produced working code very early in the process
  • they involved end-users directly, and throughout the project

The early work of DevCSI has been informed by this work – notably in the increased awareness of adoption of agile development methodologies.

So why is this important?

The radical changes currently being introduced to the economic and political landscape around higher education in the UK are forcing universities and colleges to re-examine themselves as ‘businesses’. With the growing interest in commodified hardware and software and remote software as a service (SaaS) options for service delivery, HEIs need to examine how they can best exploit these opportunities. (The JISC’s Flexible Service Delivery Programme has been established to help institutions in this). While HEIs will have differing levels and types of interest in what are being referred to as cloud services, they are generally going to be searching for efficiency-based savings.

The value proposition of financial cost-reduction from using shared services is something which cannot be ignored by HEIs – but it seems to me that there are some things which need to be born in mind:

  • the biggest saving in cloud-based services is to the supplier, not the customer (although the supplier will pass on some of this saving)
  • this whole approach is not yet well understood – especially how SaaS sits with an ‘enterprise’ service oriented architecture (SOA) approach which is also of interest to some HEIs
  • some services can be outsourced more easily, or to greater benefit, than others

In The role of the central IT Services organisation in a Web 2.0 world, Joe Nicholls and David I Harrison introduce the useful characterisation of services being either chore or core. Making use of SaaS is a form of outsourcing, and outsourcing is a tricky thing to get right. There are arguments for outsourcing those things you have to do but have no special interest in (e.g. HEIs frequently outsource their catering operation). In the ICT service context such services might include the various administration systems which all HEIs need to operate (e.g. finance). These we might call chore services. However, another reason for outsourcing is a lack of capacity or expertise to deliver a service internally – whether or not that is the preferred option. Services which are core to the HEI’s business might fall into this category occasionally – even if this is not ideal. In a recession, with drastically reduced funding, HEIs might see more core services become unsustainable – or indeed need to reconsider what is core in the first place. Normally, business decisions of this sort are not so simply binary, and some complex judgement will need to be made.

Inevitably, the growing opportunity for outsourcing ICT services will be appealing to many HEIs – whether those services are outsourced to generic or specialist commercial suppliers, or to HE-sector-based consortia such as the Kuali Foundation. But outsourcing can introduce hidden costs. A lessening of control is one obvious concern. But a more insidious risk introduced by an enthusiastic embracing of outsourcing services is a temptation to start to regard the maintenance of local development expertise as a luxury. After all, if we’re going to outsource our ICT, why do we need to retain technical staff and, especially, developers. ICT is just a  commodity, right?

Well, no. I think it is a mistake to lose sight of the advantages that come from a local capacity to perform and deal with technical innovation. A local or ‘in house’ development capacity is a valuable resource in the normal run of things. In a recession, it is vital.

The successful organisation will use a recession to examine its business and to change in order to be ready to fully exploit the economic recovery, when it comes. And large organisations are getting better at preparing themselves to be able to innovate internally or locally. Scott Anthony, who has worked with Clayton Christenson who coined the expression “disruptive innovation”, lists some principles which inform an organisation’s ability to engage in innovation:

  • Put the customer, and their important, unsatisfied job-to-be-done at the center of the innovation equation
  • Embrace the power of simplicity, convenience, and affordability
  • Create organizational space for disruptive growth businesses
  • Consider innovation levers beyond features and functions
  • Become world class at testing, iterating and adjusting

(I’m not entirely enamoured of the ‘disruptive innovation’ label – as my colleague Brian Kelly pointed out at the recent CETIS Conference, the HEI sector is receiving plenty of ‘disruption’ right now from political forces – certainly enough to encourage innovation!)

In Whither Innovation, Adam Cooper of CETIS asks: “Could we leave innovation to the commercial sector and buy it in?”. Answering his own question, he quotes Cohen and Levinthal (1990) who introduce the term absorptive capacity, describing :

…a model of firm investment in research and development (R&D), in which R&D contributes to a firm’s absorptive capacity….

I see a direct parallel between outsourcing too much, and losing the absorptive capacity necessary to respond to change and to innovate to meet new challenges. In my talk to the JISC Innovation Group, I presented this diagram:

change_management.jpg

This diagram tries to express the role of the local developer to act as an agent enabling and supporting change in an HEI. The developer deals with the remote, outsourced ICT system at a technical level, becoming one route through which the HEI ensures it gets the best possible value out of this arrangement. Remote services are, nowadays, guaranteed to offer some sort of application programming interface (API) which allows the more technically capable customer to tailor the service to their needs, rather than simply being obliged to use an undifferentiated, default user-interface for example. Local developers are increasingly networked with their peers in other HEIs (not least because of the efforts of the DevCSI project), so they become quite powerful in being able to exploit commonly used remote services through the free sharing of knowledge, technique and even code. And because local developers are, in some case, embracing a more agile approach to development, they become the conduit through which the end-user expresses their needs to make the remote, shared service better fit their local, idiosyncratic needs. Developers can become surprisingly aware of ‘business’ processes and information flows through an HEI, as they have to deal with them at several levels (I wrote about experiences of this sort in a previous post, SOA and reusable knowledge).

I see an opportunity for the DevCSI project to focus its efforts on this aspect of change within our HEIs. Change management is going to be crucial for HEIs as they redefine what is core and what is chore, as they decide what they can do best, and what can be best done for them by others. They are going to need a capable, knowledgeable and above all agile capacity to innovate to meet new business challenges and a changed ICT environment.

I’ve taken to using the label responsive innovation to describe the act of dealing with or instigating technical change in a manner which advances the core mission of the institution. Developers are not the only part of the solution, but they are a vital part. Not only do HEIs need to hang on to their best developers, they need to invest in them, if they are to manage change and not be managed by the changes being imposed on them.

Developers are core.

Institutions and the Web done better

Introduction – (warning – old-timer indulgence)

From the mid-nineties through to the end of 2006 I earned my living as a developer of Web applications, or as someone managing Web application development projects. I like to think I was quite good at it, and I certainly have a lot of experience. I worked with CGI writing in Perl and a little C, moving into ColdFusion and Java (via JServ – anyone remember that?), did the whole Java EE thing, undid it again, did SOAP because it was better than J2EE, undid that when we realised it actually wasn’t…. In about 2002/3 I adopted a RESTful approach to building Web-based intranet applications – and some of those applications are, I believe, still being used. The idea that Web applications should be designed such that the functions flowed around the resources being manipulated, rather than the resources being moved about to enrich the functions, made absolute sense to me. I have not deviated from this general approach since then. In 2006, just before I joined UKOLN, I came across Tom Coates’s ‘Native to a Web of Data‘ presentation.

A Web of data

One slide in Tom’s presentation really appealed:

native_02.jpg

Very recently I had cause to revisit this, and I began to wonder how this stacked up against current thinking. Over the last couple of years there has been a push to get Linked Data accepted by the mainstream, and there have been arguments over the extent to which this does, or does not, represent a tactic in advancing a Semantic Web agenda. I remain very skeptical about the likelihood of us realising a ‘semantic’ Web through the application of more and more structure, metadata, ontologies etc., and the aspiration toward a ‘giant global graph’ of data interests me little. However, even leading figures in the Semantic Web can be pragmatic – Tim Berners-Lee’s ‘5 Stars of Open Linked Data‘ as reported by Ed Summers are somewhat less ambitious than the nine instructions of Tom Coates’s Native to a Web of Data.

I’m also jaded by the notion of the ‘(Semantic) Web Done Right’. The ‘Web done right’ is… the Web we have. That’s the beauty of the Web – it works where many distributed information systems have not worked by taking a ‘good enough’ implementation of a really good idea and runs with it, at a massive scale. But, as ever, there is room for improvement – we can, and certainly should, aspire to a Web ‘done better’.

From documents to data

The Web to date has been largely oriented towards humans manipulating documents through the use of simple desktop tools. Until relatively recently, this was mostly a read-only experience. However, it has been clear for some time that, when content is made available in some sort of machine-readable form, it lends itself to being re-used, especially through being combined with other machine-readable content. This echoes the experience of the document-oriented Web, where it soon became apparent that there was much value to be added by bringing documents together through linking. The data-oriented Web takes this a step further: the linking is still very important, but with machine-readable content in the form of data, the possibility exists to process the content remotely, after it has been published, to merge/change/enhance/annotate/re-format it. Recent years have shown how the Web can function as a platform for building distributed systems through the rise of the ‘mashup’ as an approach to building simple point-to-point services.

The institutional context

So, what does this mean for the Higher Education Institution (HEI)? The HEI tends to already have a large amount of Web content. An HEI of any size will also maintain significant databases of structured information. In more recent years, HEIs have adopted content management systems (CMS) of one sort or another, to manage loosely structured content. In some cases, such CMS systems are also used to expose structured information from back-end databases. It is still rare, however, for a typical institutional Web Team, using a standard CMS, to pay much attention to the sort of instructions listed above. HEI Web Teams tend to work in terms of ‘information architectures’, which often follow organisational structures primarily. Their tools, processes, and expectations from senior management make this the sensible approach. However this tends to mean that, periodically, the institutional Web site will be re-arranged to re-align with organisational changes. This approach to building an institutional Web site is driven by the imperatives of the document-centric Web. It’s about trying to turn a large set of often very disparate documents into a coherent, manageable and navigable whole. The data-oriented approach demands a different approach. The following are a set of pointers to the shift in emphasis that is needed to allow HEIs to participate in the Web of data. These pointers are heavily influenced by Tom Coates’s instructions, but I have condensed and rearranged them and tried to put them into the context of the needs of an HEI.

How HEIs can engage with a Web of data

1. Recognise the potential value of the Aggregate Web of data and invest/engage accordingly

The cost of making data available on the Web is falling steadily, as technology and skills improve and the fixed costs of infrastructure are also reduced. The act of making useful data available on the Web does carry a cost, but it also introduces potential benefits. On a simple cost/benefit analysis, it is becoming apparent that we will soon be needing to justify not making data available on the Web. The ‘loss-leader’ approach, of making data available speculatively, hoping that someone else will find a use for it to mutual advantage, is one which becomes viable as the costs of doing so become vanishingly small. A lesson learned from open-source software, where the practice of exposing software source-code to ‘many eyes’ is proven to help in identifying and helping to rectify mistakes or ‘bugs’, is applicable to data too. As a general principle, exposing data which can be combined with data from elsewhere is a path to creating new partnerships and collaborative opportunities. ‘Useful data’ can range from the sorts of research outputs or teaching materials which might already be on the Web, to structured contact details for academics in an institution, to data about rooms, equipment and availability. As an example, some institutions already exploit one of their assets – meeting and teaching spaces – by renting them out to external users, especially during holiday periods of the year. Making data about these assets openly available, in a rich and structured way, opens up possibilities for others to better  exploit these assets, and for the HEI to share the benefits of this.

In addition to this, we are witnessing a wholesale cultural shift in the public sector towards opening up publicly-funded information and data to the public which paid for it to be produced. The political momentum behind this cannot be ignored and, while it is focussed on central government departments currently, this focus will inevitably widen to include HEIs.

2. Start designing with data, as well as with pages

The typical CMS is geared towards building Web pages. All modern CMS systems allow content to be managed in ‘chunks’ smaller than a whole page, such that content such as common headers, sidebars etc. can be re-used across many pages. Nonetheless, the average CMS is ultimately designed to produce Web resources which we would recognise as ‘pages’. An HEI’s web team will continue to be concerned with the site in terms of pages for human consumption. However, simply by exposing the smaller chunks of information, in machine-readable ways, the CMS can become a platform for engaging with the Web of data. My colleague Thom Bunting describes such possibilities having experimented with one popular CMS, Drupal, in Consuming and producing linked data in a content management system.

3. Develop websites for end-users, developers and software processes

This is a very important principal, and one which is frequently overlooked. Sites which are designed to allow humans to navigate pages are not necessarily accessible to software which might be able to re-process the information in new and useful ways. Widely adopted standard re-presentations of content, such as RSS feeds, have gone a some way to mitigating this. But the principal of designing for these different types of user up-front is one which is not yet widely accepted. Developers, especially, are not yet generally regarded as important users – yet for the Web of data to deliver value to data publishers we require developers to build new services which exploit that data. If you make your data available for re-use, it makes absolute sense to consider the needs of those developers you hope will try to exploit it.

A perceived problem with this is that it seems expensive – to develop web-sites for different classes of user in this way. After all, the HEI’s web team will already be considering several different sub-classes of human user (students, staff, prospective students, alumni etc.). Human end-users will continue to be the priority audience. However, there are strategies for developing websites in such a way that developers and software are not ‘disenfranchised’. An approach which marries these concerns at the beginning, rather than a bolt-on approach of extra interfaces (APIs) for developers and systems is, preferable if a common ‘anti-pattern‘ is to be avoided.

The meteoric rise in popularity of Twitter is in no small part due to its developer-friendly website. Twitter has a simple Application Programming Interface (API) which allows developers to build client applications which use the Twitter service but which add value in some way to end-users. This graph at ReadWriteWeb shows how applications built by t?hird-party developers account for mote than half of the usage of Twitter.

Some important principals which, if followed, will ensure a website is ‘friendly’ towards end-users, developers and software are points 4, 5, and 6 below:

4. Identify the important entities and make them addressable, using readable, reliable and hackable URLs

This is crucial – it forms the most important foundation for the Web of data. A traditional, well-designed website will be based on some sort of understood ‘information architecture’, however simple. The idea of starting with important ‘entities’ and making sure that they have sensible, managed and reliable identifiers is a somewhat newer approach, yet this is vital for the Web of data to function. The Web of data is, at one level, entirely about identifiers and how they link together. The ability to create (‘mint’) new identifiers and manage them carefully such that they as usable as possible is a capability which HEI Web teams will need to recognise is important. Identifiers for entities about which the HEI cares will become valuable in their own right in the Web of data. It is already understood that ownership of ‘domains’ in the Web address space can have value. In the world of business, Web domains change ownership for large sums of money frequently. In the UK the value of the HEI’s ‘.ac.uk’ domain is largely connected with reputation.

Breaking this down:

  • identify the important entities: e.g. courses, units, departments, staff, papers, rooms, learning objects, lectures…. etc.
  • make them addressable: give them URLs. For example, if it’s a course, mint a URL which points to a unique resource representing that course.
  • using readable URLs: make the URL intelligible to an end user. If it’s a URL pointing to a course, then a URL which has the word ‘course’ in it will help.
  • using reliable URLs: manage the URLs you mint, and ensure that they are persistent.
  • using hackable URLs: make the URLs predictable and consistent, such that a developer can figure out the logical structure of the URLs and the underlying information architecture. As with ‘readable URLs’ above, do not be cryptic in URLs if this can be avoided

5. Correlate with external identifier schemes

Don’t mint your own URLs for things which have been identified elsewhere. Linking to authoritative identifiers is what will create the critical mass in the Web of data – this is diluted every time someone mints a new URL to point to something already identified with a different URL. This aspect of re-using identifiers is explored in Jon Udell’s post The joy of webscale identifiers.

6. Consider individual entities, and lists of entities in Web design

URLs can be for lists of related entities, as well as individual entities. All other guidelines apply to this use of URLs. Lists of things are pretty fundamental to the Web (or just about any information system).

Conclusion

If I wanted to abbreviate this even more into three brief instructions, they would be:

  • think in terms of information entities, identifiers and relationships, as well as pages
  • integrate into the wider Web by re-using existing identifiers and by linking to other information
  • realise that developers are a potentially important stakeholder in any modern website

OR10 Challenge

In case you missed it, the OR10 Developers Challenge is now live!

Andy McGregor has explained why he thinks you should enter the challenge and, I’m pleased to say, there have been some expressions of intent already. If you do decide to enter, please register your intention on the OR10 Crowdvine forum.

A reminder of the challenge:

Create a functioning repository user-interface, presenting a single metadata record which includes as many automatically created,useful links to related external content as possible.?

We had one comment suggesting that the challenge was limited to dealing with Linked Data – this is certainly not the case – we are interested in linking in its broader sense.

 

Draft OR10 Challenge idea

Please note that what follows is a draft.

A few weeks ago I posted some thoughts about a Developer Challenge for OR10, with a plea for ideas for specific challenges. I’m pleased to say that this post got a really good response, with plenty of useful ideas and comments. Thank you to all who responded. I think it fair to say that all of the comments influenced our thinking, but the interest in linking content (most fully expressed by Andy Powell) stood out from several comments, so we have concentrated on trying to create a challenge around the this. While linked data was mentioned often (naturally enough), we wanted to stick to our principle of involving non-developers (or users) as much as possible: this can be difficult when dealing with the more esoteric aspects of linked data. So, after some discussion within the DevCSI team, we have worked up the following challenge:

Create a functioning repository user-interface, presenting a single metadata record which includes as many automatically created, useful links to related external content as possible.

Definitions:

  • “functioning” in this sense means that mockups/screenshots are not sufficient – however a working prototype is OK
  • “related” in this sense means that the external content is related to this particular metadata record in some way.
  • “as many useful links” means that marks will be awarded for useful links, so an interface with fifty meaningless links does not beat one with three genuinely useful links!
  • links must be related to content, not just a system. So, for example, a link to the page at http://www.wikipedia.org is not legitimate, but a link to a specific page in Wikipedia could be. Only one link of each ‘type’ counts: i.e. having four links to URLs which reference ‘topics’ in a given system is fine but will count as one link for the challenge.

Rules:

Entries must come from a team of at least one developer and one person representing users. The entries must be presented, in person, at OR10. If a team is responsible for the entry then not all of the team members need be present at OR10, but at least one team-member must be.

Judging:

The entries will be presented/demonstrated at OR10 in a show and tell session in a room dedicated for this. The show and tell will be open to OR10 delegates to come along and see the presentations as they are being made. These presentations/demonstrations will be video-recorded. There will be an opportunity for those delegates present (the ‘audience’) to ask questions and/or comment on the presentations. There will be a panel of judges who will observe and make notes. The judges will take note of the responses from the audience. Following the show and tell, the judges will privately discuss the entries and draw up a shortlist. The videos of the shortlisted entries will be presented at the conference dinner for the assembled delegates to vote a winner and a runner-up.

The judges will particularly take into account the following:

  • functionality – the links must work and must have been created automatically as part of the repository system
  • usefulness – the usefulness of the links to an end-user of the developed interface must be demonstrated
  • number of links – the number and variety of links will be considered
  • audience reaction – favourable and unfavourable reactions for the audience will be taken into account

General points:

The Challenge will be issued well in advance of the conference, giving people plenty of time to develop an entry. We will make facilities available at OR10 – such as a Developers’ Lounge area, for further work to be done at the conference itself.

We are very interested in any comments people may have about this – we intend to publish the final version of this, and open up the Developer Challenge, at the end of this week.

Ideas for the OR10 Developer Challenge?

Update: I have closed comment on this post now. Thank you very much to all who commented and suggested ideas for a challenge. I have now posted a draft Challenge here and would welcome comments on that post. Thanks again!

Through the JISC-funded DevCSI project, UKOLN has been asked to arrange a ‘Developer Challenge’ for the Fifth International Conference on Open Repositories, (OR10) to be held in Madrid in July of this year.

This will be the third consecutive year that the Developer Challenge has been a feature of this conference. Previous challenges have been both competitive and creative.

OR09_dev_challenge.jpg

Photo by Graham Triggs

This year we have been considering doing something slightly different. Previously, a general challenge has been issued, inviting developers to submit prototypes for anything which they feel is relevant and useful to the repository community. But now that the community has a better appreciation of the sort of creativity which developers can bring to these events, we wonder if we might try something a little different.

A general challenge?

We have been thinking about the possibility of the repository community issuing a particular challenge to the developers planning to attend OR10. This could be decided on by the community well in advance of the conference. If we managed to ‘crowd source’ a few ideas, we could organise a simple vote. Something we are trying to do more with the DevCSI project is to get developers together with non-developers from the same ‘domain’ (repositories in this case) – so we are quite interested in pursuing this approach with OR10.

The OR10 organisers have helpfully couched the conference itself in terms of some challenges:

In a world of increasingly dispersed and modularized digital services and content, it remains a grand challenge for the future to cross the borders between diverse poles:

  • the web and the repository
  • knowledge and technology
  • wild and curated content
  • linked and isolated data
  • disciplinary and institutional systems
  • scholars and service providers
  • ad-hoc and long-term access
  • ubiquitous and personalized environments
  • the cloud and the desktop.

Perhaps one or more of these could serve as the inspiration for a more concrete developers challenge?

What this boils down to is finding a challenge in the general area of repositories, recognised as important by the community generally, which could only be met by getting developers to work with non-developers at the conference. For it to be fair, the challenge would need to be non-specific with regard to any particular repository software.

I would welcome some feedback:

  • is this general approach a good idea?
  • do you have any ideas for a challenge?

please feel free to comment her if you have any ideas, or alternatively drop me an email at p.walk@ukoln.ac.uk

Thanks!