Repository architecture #83

At a JISC workshop last Thursday I was invited to present some ideas around an architecture to support and exploit repositories in the UK. I gave the presentation the title Repository Architecture #83 . ;-)

My intention was to suggest some starting principles and then explore how they held up in the face of real-world issues. Here is the slide where I outlined these principles:

presentation.004.png

I also asked the question: “do we actually need a new architecture?” - suggesting that there is already a ubiquitous & successful architecture supporting much/most/(all?) of the functionality we want from repositories. Taking a resource oriented approach also seems to offer all kinds of advantages. Applying this approach is certainly not a new idea - others have been here before. However, I suggest that the resource oriented approach and the service oriented approach can be most effective when used to complement each other. I think that there is still be place for the institutional repository as the collection of systems which surround what I call the source repository. I define the ’source repository’ as an (ideally) quite simple system which contains:

  • the resources themselves, individually addressed with HTTP URIs
  • simple, item-level metadata records
  • site-map(s) to aid remote search engines
  • public, HTTP interfaces
  • feeds to notify remote agents of the deposit of new resources in the repository (RSS and/or Atom)

An ‘institutional’ or ’subject’ or ‘learning object’ repository contains one or more source repositories plus any systems needed to manage it in its particular context. These larger repositories might be very complex: the important point is that the logical component I call the source repository should be as simple as possible in it’s public facing interface: basically a bunch of resources, with an address space. So, a resource is given a Cool URI , and a (probably) simple metadata record is made available, also as a resource with a URI. I suggested that an ORE resource map could be used to relate metadata record to resource - from the point of view of the web or ORE, a metadata record is a resource just like, for example, a PDF of a scholarly paper. Elsewhere more, richer metadata might be created through mechanisms ranging from automatic metadata creation, to further human effort which might be in the nature of traditional cataloguing by trained and motivated individuals, or ‘crowd-sourced’ tagging by untrained but still motivated people.
Complexity is introduced, where necessary, in services developed to manage and exploit resources held in source repositories. Crucially, such activity does not happen unless there is a clear incentive for it, and then it happens close to the point of incentive. As an example, if a particular domain has a strong need to classify papers then someone might go to the trouble of harvesting, aggregating and text-mining the text of these papers with a view to extracting terms to use for classification. Or something similar might be achieved through the application of a team of professional cataloguers using an agreed vocabulary. However it is done, the new metadata thus created could be made available as a web resource where it could be used and combined with other resources as required.
I was asked to illustrate this with a few diagrams which provoked a fair amount of discussion.

deposit.png discovery.png

The point was made, strongly, that it is subject repositories which have the content, rather than institutional repositories. Regardless of whether this is, or will continue to be true, I think the architectural principles hold up. The business drivers are, I guess, quite different!

I learned a lot from the workshop and had some of these ideas challenged quite robustly. I think they held up but the clarity of presentation could be improved - this is what I will be working on now.

Tags: , , , ,

11 Responses to “Repository architecture #83”

  1. Rachel Heery Says:

    There was some discussion on the day as to how far one should label workflow and services around the source repository as being ‘repository’ workflow or ‘repository’ services - both at institutional and wider level. I think there are some benefits in articulating these associated workflows and services more generically as ‘research information systems’ and ’scholarly communication’. This might go some way to acknowledge the flexibilty and differences between the various approaches out there.

  2. JISC meetings « Names Project Blog Says:

    [...] the morning about how such an architecture might look - links to his slides are available from his blog entry about the day. As Paul points out, there was a fair amount of discussion about repository content [...]

  3. Unilever Centre for Molecular Informatics, Cambridge - Jim Downing » Blog Archive » Repository architectures, leaky abstractions and Paul’s principles Says:

    [...] Walk kicked off the day by presenting his work on a next generation architecture for repositories. His presentation started off with a number of starting principles and moved on to some diagrams [...]

  4. Richard Akerman Says:

    I wouldn’t say reusability is over-rated as a virtue. I would say that reusability is really, really hard. In any case I think we frame the discussion differently these days. Originally reusability was so that you could maximize your ROI. Now we’re more concerned with flexibility, to maximize your… hmm… Change On Investment? So instead of “design & build an architecture that supports reusability”, let’s say “design & build an architecture that is flexible”. Flexibility Oriented Architecture?

  5. Chris Rusbridge Says:

    Three thoughts. I just don’t get the Technorati versus Slideshare reference. Technorati sucks! I don’t know if this is because it’s by reference, or just that its business model is so broken that it can’t afford sufficient grunt to make it work, but it loses things, known searches fail, multi-page search results loop… I wouldn’t use it at all except for the “authority” figure (which all of the above makes doubtful. Slideshare, OTOH, just works.

    Second, I agree that most of the data that are in repositories are in subject repositories. Likewise most of the papers that are in repositories are in IRs (perhaps with the honourable exception of Arxiv). But most data are not in repositories at all. (Likewise most papers.) Most data cannot go into subject repositories because there are not enough of them, because the business model for subject repositories is fragile in the extreme. The business model for IRs, however, looks more robust the more content they can capture. So I think we need to find more ways of getting our data into sustainable IRs.

    Thirdly, (I think I catch your support for this by implication but not explicitly) getting the stuff in involves both incentives (in making science easier) and integration with researcher workflow. If they have to come back and do it later, I believe it just won’t happen. It’s why I was suggesting the (possibly over-complicated) research repository system idea in my blog post…

  6. paul Says:

    Chris,
    three quick responses:
    First: I agree that Technorati does not always deliver…. I think the authority figure is utterly suspect - my own feeble ‘authority’ figure seems to change with the weather….
    However, while the execution is not great, I still think the model is a good one. The problem which Technorati tries to solve is considerable - it allows the creation of content to be remain completely distributed, unlike Slideshare.

    Second: I agree with you - I believe the institution is well placed manage this - we should see this activity grow as institutional workflows become established and smarter.

    Three: yes - rather than giving overt incentives to ‘deposit in the repository’, give the incentive to ‘use our fabulous workflow tools’ - which happen to deposit resources as appropriate.

  7. Steve Hitchcock Says:

    Three points arising from Paul’s presentation and the comments:

    1 On subject vs institutional repositories, *some* of the content (papers) is in subject Rs, of course. But the real question is: what about all the rest? Where is that to become OA if 100% is the target?
    2 Paul’s idea of ’source’ repositories is good if it means getting the repository interfaces closer to the end-users, perhaps school or dept. repositories, with a central services approach institutionally.
    3 Complexity appears to be rife. Data, metadata, workflows, etc. - all need to be worked out to ensure value always exceeds cost. We cannot afford to burden IRs with unconstrained expectations, responsibilities, and costs, especially in these uncertain economic times.

  8. paul Says:

    Regarding point 3 - as Jim Downing suggests, the best we might do is to move the complexity around. I suggest moving it away from the source repository - but the institutional repository may have to pick up much of this.

    But yes - those expectations should be managed!

  9. Dorothea Salo Says:

    Paul, may I borrow that screenshot (with credit, of course) for a presentation I am giving?

  10. paul Says:

    Dorothea,
    Be my guest :-)

    I’d be interested to see how you use it…..

    Paul

  11. Bookmarks about Repository Says:

    [...] - bookmarked by 2 members originally found by rickenriquericky on 2008-08-09 Repository architecture #83 http://blog.paulwalk.net/2008/07/07/repository-architecture-83/ - bookmarked by 3 members [...]

Leave a Reply