Making digitised content available for searching and harvesting

I have been invited to give a short presentation to the JISC Digitisation Programme on Friday, giving an overview of different ways of exposing content and metadata. I’ll be talking to projects which are concerned with Cultural Heritage content which is being surfaced in websites to support eLearning. Formats vary tremendously. This is the complete list:

Aside from the obvious stuff like OAI-PMH, Google, RSS, what should I be talking about? Persistent identifiers? Cool URLs? Any other suggestions?

Tags: ,

12 Responses to “Making digitised content available for searching and harvesting”

  1. Jim Downing Says:

    Mainly on making the content reusable (not a hard sell in eLearning?). Recent use of RDF and Atom in a cultural setting: Asemantics BBC aggregator

  2. paul Says:

    Thanks for this Jim. I was planning to introduce the resource-oriented-architecture but you’ve reminded me that there is plenty to say about re-usable content in terms of machine parsable pages, microformats, RDF etc. as well. The BBC example is perfect - I’ll use it on Friday (with an appropriate nod towards the person who suggested I use it of course!).

    Paul

  3. Owen Stephens Says:

    I think your ‘obvious stuff’ looks good, and you may already have these covered but:

    As well as formats for reuse, what about licensing - Open Data work by Talis and others, Creative Commons etc.

    Integration of resources into the wider web - e.g. LoC experiment with Flickr to expose content. Many projects in this area create a new silo of material that is hidden from the wider web - need to change this.

    Reusable metadata as well as objects.

    OAI-ORE - although I’m not sure if this is the next step or a dead-end of course!

  4. paul Says:

    Thanks Owen. Re-usable metadata is an angle I didn’t have down to cover, so I’ll have a think about this. I guess you mean metadata which can be re-purposed in some way, rather than simply harvested and made available again.

    I think that the thorny issues of licensing are outside my remit for Friday - but as you point out, there have been some recent development which I could certainly mention.

    The LoC / Flickr thing is new to me - look like an example I can use.

    Thanks,

    Paul

  5. Ian Ibbotson Says:

    One of our experiences from the MLA Peoples network discover project (eg http://www.peoplesnetwork.gov.uk/discover/search.do?operation=searchRetrieve&version=1.1&landscape=Default&query_type=pqf&query=%40attrset+bib-1+%40attr+1%3D1016+%22Nottingham%22+%22lace%22+%22market%22+&pageno=1&hpp=10&eset=b&qrymode=simple&link_style=no) which aggregates and makes searchable upwards of 20-odd digitisation projects, is that it’s really, really helpful if the metadata contains an easily usable link to the actual manifestations (Different formats, sizes, etc) of the digital resource, and that this link is identified differently to the “front page” link for that resource.

    It’s very hard to engineer a consitent search user interface when half the metadata refers to the actual digital artefact, and half to a front page. It’s useful to have both links, as you can then negotiate with providers of they feel you need to go through a front page for stats and marketing, but we stil don’t have any real consistency in this at the moment. This isn’t so much about standards as convention I feel (I’m on the convention over configuration band-wagon again).

    Another thing we had issues with in the search engine was thumbnail sizings. If you’re extracting thumbails from the content source, rather than creating them locally, it’s a bit of a battle to get them sized nicely for your interface.

    We also had pretty severe issues with getting consistent spatial metadata. Mixed place names, getty-like (but not actual getty) place names, and unqualified place names in spatial data elements have been hard to deal with.The classic mix of Lat-Long with Northings and Eastings was a walk in the park compared to the place name issues :)

    In terms of recent repository events, one of our biggest saviours has been having both OAI for “Pull based” population, but also a sword like (And now actual-sword) deposit API. This deposit API for the search index and preservation in the metadata repository means that where the data providers have just plain failed to implement OAI (I know, I know, its not *that* hard) we were able to give them a special upload tool that converts Access databases, excel spreadsheets, or just directories of XML into a stream of deposit requests. We even used the tool to convert a static OAI repo into a stream of deposit requests. The projects loved this, as the upshot was that they were then able to use the discover services own OAI function to expose their own data to the rest of the community.

    Finally, shared controlled vocabs for attributes like subject have been issues for us. There’s growing interest in mapping from uncontrolled vocabs at particular providers into a single vocab-spine at particular aggregators. This makes it easir for people to subscribe to a give RSS subject feed and know that the “Nottingham Lace Market” feed also includes all the “Lace Market, Nottingham” category resources from other providers.

    Hope theres something in there that might be helpful.

    Cheers,
    Ian.

  6. Ian Ibbotson Says:

    Ooh sorry, this also may be of interest:

    http://k-int.blogspot.com/2008/02/mlas-cultural-learning-objects-project.html

    Outlines MLA’s cultural tagging tool projetc. Here we extract DC based metadata from the discover service to get title and identifer, then go fetch a full record in it’s native schema (E.G. RSLP) and convert it into IEEE LOM, which is then uploaded to a LOM learning object metadata repository. I’ve popped some screen shots in there, it’s a pretty nice demo that covers restful SRU search of cultural data, conversion into lom, and supporting the user adding all the missing data, then SWORD for the final deposit of the LOM record.

    HTH.

    Cheers, Ian

  7. paul Says:

    Wow! Great stuff Ian - I’m sure I can use plenty of this.

    Thanks!

    Paul

  8. Pete Johnston Says:

    Rather in haste, but:

    Linked Data

    http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/

    = (roughly) Web Arch + lower-level Semantic Web technologies (without getting too drawn into the inferencing, ontology side of things).

    And a shift away from the “repository” towards the “collection” or “collections” (which I think is the consequence of a more “resource-oriented view”) i.e. instead of issuing a “listIdentifiers” request or similar, you GET a representation of a collection and that representation includes a list of identifiers of the constituent items, and so on.

    Also the importance of shared domain models e.g. FRBR, CIDOC CRM.

  9. paul Says:

    Thanks Pete - good stuff. I’ve just been talking about linked data to the JISC JIIE Committee rhis morning!
    Hadn’t thought of mentioning shared domain models - I may work this in as well.

    Cheers,
    Paul

  10. Mike Ellis Says:

    Paul

    I’d go for simple (notice my recurring theme here..):

    1. sweet urls, unlike the one posted above which stretched about 4 foot off the rhs of the page :-)
    2. RESTful data sourcing
    3. XML template delivery (ie, have key pages such as search results delivered using a programmed rather than visual template) - a kind of poor man’s API
    4. RSS, and possibly “programmable” RSS (for example, surfacing search results by adding query parameters to the feed address, etc)

    Personally I wouldn’t bother with the deep stuff like OAI, but that’s me. I like simplicity and don’t like things I don’t understand..

    ta

    Mike

  11. paul Says:

    Thanks Mike. I like the ‘poor man’s API’ comment - will maybe try to use that.

    Paul

  12. paul walk’s weblog » Blog Archive » Making digitised content available for searching and harvesting(2) Says:

    [...] in February I was asked to give a talk to the JISC Digitisation Programme meeting. I blogged about this shortly beforehand asking for comments and suggestions. The response was fantastic - I received a bunch of great [...]

Leave a Reply