Archive for the ‘Metadata’ Category

Repository architecture #83

Monday, July 7th, 2008

At a JISC workshop last Thursday I was invited to present some ideas around an architecture to support and exploit repositories in the UK. I gave the presentation the title Repository Architecture #83 . ;-)

My intention was to suggest some starting principles and then explore how they held up in the face of real-world issues. Here is the slide where I outlined these principles:

presentation.004.png

I also asked the question: “do we actually need a new architecture?” - suggesting that there is already a ubiquitous & successful architecture supporting much/most/(all?) of the functionality we want from repositories. Taking a resource oriented approach also seems to offer all kinds of advantages. Applying this approach is certainly not a new idea - others have been here before. However, I suggest that the resource oriented approach and the service oriented approach can be most effective when used to complement each other. I think that there is still be place for the institutional repository as the collection of systems which surround what I call the source repository. I define the ’source repository’ as an (ideally) quite simple system which contains:

  • the resources themselves, individually addressed with HTTP URIs
  • simple, item-level metadata records
  • site-map(s) to aid remote search engines
  • public, HTTP interfaces
  • feeds to notify remote agents of the deposit of new resources in the repository (RSS and/or Atom)

An ‘institutional’ or ’subject’ or ‘learning object’ repository contains one or more source repositories plus any systems needed to manage it in its particular context. These larger repositories might be very complex: the important point is that the logical component I call the source repository should be as simple as possible in it’s public facing interface: basically a bunch of resources, with an address space. So, a resource is given a Cool URI , and a (probably) simple metadata record is made available, also as a resource with a URI. I suggested that an ORE resource map could be used to relate metadata record to resource - from the point of view of the web or ORE, a metadata record is a resource just like, for example, a PDF of a scholarly paper. Elsewhere more, richer metadata might be created through mechanisms ranging from automatic metadata creation, to further human effort which might be in the nature of traditional cataloguing by trained and motivated individuals, or ‘crowd-sourced’ tagging by untrained but still motivated people.
Complexity is introduced, where necessary, in services developed to manage and exploit resources held in source repositories. Crucially, such activity does not happen unless there is a clear incentive for it, and then it happens close to the point of incentive. As an example, if a particular domain has a strong need to classify papers then someone might go to the trouble of harvesting, aggregating and text-mining the text of these papers with a view to extracting terms to use for classification. Or something similar might be achieved through the application of a team of professional cataloguers using an agreed vocabulary. However it is done, the new metadata thus created could be made available as a web resource where it could be used and combined with other resources as required.
I was asked to illustrate this with a few diagrams which provoked a fair amount of discussion.

deposit.png discovery.png

The point was made, strongly, that it is subject repositories which have the content, rather than institutional repositories. Regardless of whether this is, or will continue to be true, I think the architectural principles hold up. The business drivers are, I guess, quite different!

I learned a lot from the workshop and had some of these ideas challenged quite robustly. I think they held up but the clarity of presentation could be improved - this is what I will be working on now.

Tags: , , , ,

Linked data from OAI repositories

Thursday, May 1st, 2008

Here’s an interesting approach. Bernhard Haslhofer at Media Spaces has developed OAI2LOD Server, a system which harvests metadata with OAI-PMH, processes the records to create a triple store and exposes interfaces to this for linked-data clients, SPARQL clients and web-browsers.

According to the web-page:

The OAI2LOD Server exposes any OAI-PMH compliant metadata repository according to the Linked Data guidelines. This makes things and media objects accessible via HTTP URIs and query able via the SPARQL protocol.

I find myself wondering if there is an application for this software in the institutional repositories space. Leaving the SPARQL aspect aside for a moment, note that this system makes resources available via URLs, having harvested metadata via OAI-PMH. I know from experience that there are all kinds of issues with simply identifying a link to a ‘thing or media object’ in many metadata records harvested from institutional repositories, so how well this works in practice remains to be seen. However, this could provide another approach to getting digital objects buried in repositories exposed as resources in the web-architecture. And while I don’t suppose that OAI2LOD is particularly aimed at institutional repositories, the SPARQL & linked-data interfaces do perhaps offer a route for some suitable repositories to participate in the web of data.

I’m also currently working with large, heterogeneous aggregations of metadata from repositories, so I’m curious to see how this software might fit with that kind of dataset. My guess is that this system will work best with collections which already contain some semantic coherence in the sense that it might suit a subject-based repository rather better than an institutional repository, although the three examples demonstrated on the OAI2LOD site are for national libraries.

So, what’s the real value of this software?. There are some perfectly good alternative systems offering triple stores with similar interfaces. And there is plenty of OAI-PMH harvester software out there. I haven’t seen these two things joined together directly in this way before, which is what has piqued my interest initially. But I assume that the real value must lie in the processing of the metadata records (and other information gleaned as part of the OAI-PMH transaction) into the triple store.

Anyway - it’s an interesting idea coupled with some working code - always a valuable thing in my book!

Tags: , , ,

Google gives up on supporting OAI-PMH for Sitemaps

Wednesday, April 23rd, 2008

For some time now I have occasionally advised people involved in repository administration that they should consider registering the Base URL of their OAI-PMH interface (if they have one) with Google as a proxy for a Sitemap. Until recently, Google has supported the use of OAI-PMH Base URLs in its Webmaster Tools which site owners can use to create and register sitemaps in order to give hints about the structure of the website to Google’s web-crawler.

A while ago, I noticed that there was no longer any reference to this particular support in any of the documentation and began to suspect that this was being deprecated. Today, Google announced via their official blog that:

…we’ve found that the information we gain from our support of OAI-PMH is disproportional to the amount of resources required to support it. Fewer than 200 sites are using OAI-PMH for Google Sitemaps at the moment.

In order to move forward with even better coverage of your websites, we have decided to support only the standard XML Sitemap format by May 2008. We are in the process of notifying sites using OAI-PMH to alert them of the change.

Fewer than 200 sites…..

There are a few ways of looking at this. Perhaps ‘open access’ repositories are less concerned with Google rankings than the typical website owner. Perhaps the penetration of OAI-PMH in the world is still below any level that Google could find particularly interesting - certainly they never went to great lengths to advertise this support while it lasted. Clearly, Google have come to the end of a ‘trial period’ for their support for this protocol in their main indexing service.

Can we conclude anything from this? Probably not - surely OAI-PMH can thrive without Google Sitemap support? It certainly plays a fairly significant part in my professional life at present! Or should we view this as a symptom of decline….?

The official Google announcement is here.

Tags: , ,

Bespoke metadata creation tools are commonplace

Tuesday, September 18th, 2007

RLG Programs has conducted a survey of partner institutions which have “multiple metadata creation centers” to:

…gain a baseline understanding of current descriptive metadata practices and dependencies, the first project in our program to change metadata creation processes.

Some intriguing statements in this summary post (I look forward to getting hold of the report when it’s completed). For example:

76 listed the tools they used to create metadata. Guess how many tools were named? Over 270 in total, 88 different ones. And the most common? A custom system. Besides an integrated library system, the tool most frequently cited was MS Access. In several cases, a single institution used more than a dozen different tools.

In the complex world of metadata standards, it is perhaps not surprising that the range of tools used to author metadata is broad. Is this a permanent state of affairs? I wonder if this says anything about the maturity of this space - or is this just the nature of this particular beast?

Read more at hangingtogether.org

Technorati Tags: , , ,

  • Recent Comments

  • Recent Posts

  • Syndication

  • License