Archive for the ‘Information Management’ Category

Blog commons?

Friday, July 25th, 2008

You may have noticed that I have included a statement on this blog’s ‘home-page‘ to the effect that:

This work is licensed under a Creative Commons Attribution 2.0 UK: England & Wales License.

This is standard blurb from the Creative Commons (CC) site. In the context of my blog this means - well, what exactly? Feel free to use anything you find here, for whatever purpose you like, so long as you credit me? What about material I include from elsewhere? What about other people’s comments on my posts? It seems to me that this just isn’t clear enough….

And another problem - I don’t necessarily want to apply the same license, indiscriminately, to all of my posts. I probably want credit/attribution for anything I write here, true, but I might feel differently about commercial re-use of the contents of different posts (although I’m probably deluding myself if I think that my blog has potential for commercial exploitation!).

In point of fact, I actually changed the license on my blog a while ago, to remove the non-commercial use clause from my Creative Commons 2.0 license. I guess this is pretty poor practice as it has, by implication, retrospectively changed the license I applied to past entries. So far, no one has complained…. ;-)

Would it be better practice to attach a license to the text of each post, rather than to the blog as a whole? Is the ‘post’ closer to being a ‘work’ in CC terms? Even better, should I embed the license as a footnote to the content itself? Currently, my CC license declaration is simply an artefact of the user interface I host at http://blog.paulwalk.net/index.php - it doesn’t even appear in the RSS feed. If I licensed each post, rather than the blog as a whole, I could be selective about licensing content (perhaps maintaining a sensible default to avoid unnecessary work). And I could move to a different license later without feeling vaguely guilty. I guess I could include a statement making clear to people who want to post comments on my blog just how their comments are going to be licensed. Or even allow the them to select a license themselves….!

It occurred to me that someone might have developed a ‘Creative Commons License plugin’ for Wordpress, the blog engine used to manage this blog. In fact, I found two very easily, WpLicense and the Creative Commons Configurator. However, both of them apply the CC license in a system-wide manner, rather than to each individual post. This is an improvement over my current practice, as the license will show up in the blog’s public RSS feeds for example, but it’s not really what I have in mind. I’m pretty sure I could insert license statements in the necessary templates if it came to it, and maybe code up a plugin to allow me to select from a menu of licenses. However, it occurs to me that I don’t particularly want to use Wordpress as the ‘author’ tool (currently I use Ecto)

Whatever. I can’t help thinking that attaching a license to a blog feels a little like licensing the deployment of a content management system, rather than the content itself. Anyone care to comment?

Tags: , , ,

Repository architecture #83

Monday, July 7th, 2008

At a JISC workshop last Thursday I was invited to present some ideas around an architecture to support and exploit repositories in the UK. I gave the presentation the title Repository Architecture #83 . ;-)

My intention was to suggest some starting principles and then explore how they held up in the face of real-world issues. Here is the slide where I outlined these principles:

presentation.004.png

I also asked the question: “do we actually need a new architecture?” - suggesting that there is already a ubiquitous & successful architecture supporting much/most/(all?) of the functionality we want from repositories. Taking a resource oriented approach also seems to offer all kinds of advantages. Applying this approach is certainly not a new idea - others have been here before. However, I suggest that the resource oriented approach and the service oriented approach can be most effective when used to complement each other. I think that there is still be place for the institutional repository as the collection of systems which surround what I call the source repository. I define the ’source repository’ as an (ideally) quite simple system which contains:

  • the resources themselves, individually addressed with HTTP URIs
  • simple, item-level metadata records
  • site-map(s) to aid remote search engines
  • public, HTTP interfaces
  • feeds to notify remote agents of the deposit of new resources in the repository (RSS and/or Atom)

An ‘institutional’ or ’subject’ or ‘learning object’ repository contains one or more source repositories plus any systems needed to manage it in its particular context. These larger repositories might be very complex: the important point is that the logical component I call the source repository should be as simple as possible in it’s public facing interface: basically a bunch of resources, with an address space. So, a resource is given a Cool URI , and a (probably) simple metadata record is made available, also as a resource with a URI. I suggested that an ORE resource map could be used to relate metadata record to resource - from the point of view of the web or ORE, a metadata record is a resource just like, for example, a PDF of a scholarly paper. Elsewhere more, richer metadata might be created through mechanisms ranging from automatic metadata creation, to further human effort which might be in the nature of traditional cataloguing by trained and motivated individuals, or ‘crowd-sourced’ tagging by untrained but still motivated people.
Complexity is introduced, where necessary, in services developed to manage and exploit resources held in source repositories. Crucially, such activity does not happen unless there is a clear incentive for it, and then it happens close to the point of incentive. As an example, if a particular domain has a strong need to classify papers then someone might go to the trouble of harvesting, aggregating and text-mining the text of these papers with a view to extracting terms to use for classification. Or something similar might be achieved through the application of a team of professional cataloguers using an agreed vocabulary. However it is done, the new metadata thus created could be made available as a web resource where it could be used and combined with other resources as required.
I was asked to illustrate this with a few diagrams which provoked a fair amount of discussion.

deposit.png discovery.png

The point was made, strongly, that it is subject repositories which have the content, rather than institutional repositories. Regardless of whether this is, or will continue to be true, I think the architectural principles hold up. The business drivers are, I guess, quite different!

I learned a lot from the workshop and had some of these ideas challenged quite robustly. I think they held up but the clarity of presentation could be improved - this is what I will be working on now.

Tags: , , , ,

Making digitised content available for searching and harvesting(2)

Monday, April 28th, 2008

Back in February I was asked to give a talk to the JISC Digitisation Programme meeting. I blogged about this shortly beforehand asking for comments and suggestions. The response was fantastic - I received a bunch of great suggestions and incorporated many of them into the presentation. Everyone who commented got a public ‘thankyou’ at the event, and I included all names in the slides I used.

I have finally gotten around to making the slides available (someone who was at the meeting has asked for them so they made some sort of impression with someone!).

Thanks again.

Tags: , ,

Digital library pipeline for a million books.

Sunday, March 16th, 2008

I was pleased to be invited by Brian Fuchs to a ‘Million Books Workshop’ at Imperial College, London last Friday. A fascinating day, in the company of what was, for me, an unusual group of 20-30 linguists, classical scholars and computer scientists. The morning session consisted of three presentations (following an introduction from Gregory Crane which I missed thanks to the increasingly awful transport system between London and the South West) which brought us up to speed with some advances in OCR, computer aided text analysis and translation, and classification. The presentations were intended to form an ordered progression:

  1. From Image to Text: OCR and Mass Digitisation (Dr. Thomas Breuel, DFKI and Technical University Kaiserslautern)
  2. From Text to Information: Machine Translation and Syntax Recognition (David Smith, Johns Hopkins University, & David Bamman, Perseus Project)
  3. From Information to Learning: Machine Learning and Classification Techniques (David Mimno, U Mass, Amherst)

Listening to these presentations, I quickly found myself well outside of my comfort zone, in terms of both the science and the domain (classical literature), so it was a challenging and exhilarating morning! It was difficult to take comprehensive notes as I had to really concentrate on the presentations at times in order to follow them - the pace was pretty smart, with jargon and ‘in jokes’ galore.

David Smith, Johns Hopkins University gave a fascinating and entertaining presentation which outlined some of the challenges, and advances, in language parsing and translation. He pointed out that although the structured view of the semantic web is a seductive one, even the newer online, digital genres such as email, blogs mostly use unstructured or semi-structured text. However, parsing free text is very difficult, especially with the growing scale and diversity of texts available on the web. To illustrate this he employed a series of (sometimes amusing) translations from the Google translation service. The best available technology today uses supervised machine learning techniques to build a treebank. An alternative approach employs semi-supervised, modelling techniques. Parallel texts in different languages are useful but, for some languages, only the Universal Declaration of Human Rights exists as a parallel text! As an aside, David pointed to the potential advantage in search engines searching several languages: if you enter your query in English for example, by searching resources in other languages, the search engine automatically expands the search, utilises synonyms etc. ‘for free’. This can then be more effective than monolingual searching. David offered a future based in pragmatism: translation support rather than fluent translation.

David Mimno presented on classification, sequences & topic modelling. In an interesting talk, it was the visualisation (as a topical transfer graph) of topic relations extrapolated from citations in a set of scholarly communications which really got the audience engaged - a series of questions ensued before David could move on. He illustrated his work with accessible examples: for example, it turned out from one experiment that the single term most likely to identify email spam was, believe it or not, the word “red” showing in the markup, owing to the fact that “only spammers use red text”! Apparently, he had a system which could classify any of Shakespeare’s plays as tragedy, comedy or history…. with the exception of Romeo and Juliet, which comes out as a comedy for some reason….

The takeaway for me was that some of the technology in these spaces is maturing. Thomas Breuel, for instance, made a compelling case for really effective OCR (Optical Character Recognition) in his description of the open-source OCRopus project, which he leads and which is sponsored by Google. Building on previous systems like the character recognition system Tesseract, OCRopus employs a modular design with components which offer the following workflow, focussing on the processing of scanned books:

layout analysis -> isolated character recognition -> statistical language modelling - > text

The project is heading towards a beta release this year, and the team plan to create a deployment ‘bundle’ in the form of an Amazon EC2 AMI. I didn’t quite catch the details but I think they have found a way to monetise this through the Amazon referral program, which sounds interesting. In any case, the idea is that one could take the AMI, deploy it, run it for a few hours to process a particular scan, and then shut it down again - potentially a very cost-effective way of proceeding. Thomas made the point that, as OCR technology continues to improve, we are likely to want to process scans of books several times. He explained how the project was aiming towards a “full digital library pipeline”, a system which could be deployed from a connected laptop: with the new affordability of powerful digital cameras, a researcher might photograph a book’s pages themselves before feeding the resulting image into the OCRopus workflow OCRopus can handle the distortion effect of non-flat pages very effectively). Another interesting aspect of this work is the distributed parallel training which underpins the statistical language modelling: a large model is achieved by combining many little models created by many people, through the web. If you are interested in this area, then you should also check out the hOCR format specification and tools.

I had been invited to this workshop because of my role and interest in the deployment of services at a community and network level. I joined a panel at the the very end of the day where we were invited to consider what services and infrastructure might be required, in the UK, to support the digitisation and useful processing of a ‘million books’. We didn’t get very far with this because we had run out of time and, I suspect, energy by this point, but the question remains…. I’ll be picking this up with some colleagues in due course.

Fascinating day, and topped of with a quick pint standing outside a packed London pub in a light drizzle, which was actually a refreshing and pleasant way to conclude!

Tags: , , , , , ,

A minor response to Repositories thru the looking glass

Thursday, February 14th, 2008

In Repositories thru the looking glass over on the eFoundations blog, Andy Powell gives a summary of a keynote he gave to the Vala Conference last week. It’s interesting stuff, and I will take the time to look at the presentation slides as well. I mostly agree (vehemently in some instances) with Andy’s points, though I do find myself questioning some parts of this, so I’ll quote some snippets and make a few comments here.

Firstly, that our current preoccupation with the building and filling of ‘repositories’ (particularly ‘institutional repositories’) rather than the act of surfacing scholarly material on the Web means that we are focusing on the means rather than the end (open access)

It’s hard to deny that there is a current preoccupation with establishing repository systems of one kind or another and populating them with content, and also that there is a focus on institutional deployments. However, I’m not convinced that open access is (or at least is going to remain) the sole driver behind the development of institutional repositories. From an institutional perspective, it absolutely makes sense to want to manage the outputs of research conducted within the auspices of that institution.

A common use for an institutional repository is to house eprints. Were it not for the open-access imperative, we might have expected software designed to manage eprints to fall somewhere between a document-management and a content-management system - both familiar to a large number of institutions. I think it is interesting that it might be considered to be open-access which has skewed the development of repository software in some respects - the community has largely started from scratch, building repository software, where it might have made more sense to simply adapt what was there.

So I half agree with Andy - we do seem to be focussed on the means, but I think I am sympathetic to those (institutions at least) who find themselves pre-occupied with this.

Secondly, that our focus on the ‘institution’ as the home of repository services is not aligned with the social networks used by scholars, meaning that we will find it very difficult to build tools that are compelling to those people we want to use them. As a result, we resort to mandates and other forms of coercion in recognition that we have not, so far, built services that people actually want to use. We have promoted the needs of institutions over the needs of individuals.
Instead, we need to focus on building and/or using global scholarly social networks based on global repository services.

There are four sentences here, and I completely agree with the first three and a half! I find myself wondering who ‘we’ are in this. Now that institutional repositories are becoming a reality, the ‘we’ is going to expand to include people who simply have institutional interests - who have no real interest in open-access for example beyond it being a requirement for them to support. The MIS Manager of your average institution, for example, will start to get involved once institutional repositories get embedded into the business which is a university. The half sentence I don’t quite buy is the “global repository services”. Why can’t we “focus on building and/or using global scholarly social networks” (which I support) based on institutional repository services? We don’t have a problem with institutional web sites do we? Or institutional library OPACs? We have certainly managed to network the latter on a global scale, and built interesting services around this….

Finally, that the ’service oriented’ approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the ‘resource oriented’ approach of the Web architecture and the Semantic Web. We need to recognise the importance of REST as an architectural style and adopt a ‘resource oriented’ approach at the technical level when building services.

Absolutely - couldn’t agree more. Yesterday, at a JISC committee meeting, I argued that a resource-oriented-architecture and the service-oriented-approaches being encouraged by the e-Framework could complement each other if intelligently and judiciously applied. Incidentally, last Friday, I attended an excellent CRIG workshop devoted to exploring the relevance of ReST to repositories. Matt Zumwalt of MediaShelf showed a working ReST interface on Fedora, and Oxford University’s Ben O’Steen used this to develop a client app, in real time, in Python.

I think we agree that the individual’s interests may often be orthogonal to those of the institution. This may have always been the case but it is, perhaps, increasingly an issue as recent developments and trends on the Web empower the individual at an accelerating rate. I wonder if the user-centric/institutional/global debate around repositories is just symptomatic of a tension about to become apparent all over the (institutional) Web?

Having said all this, when visiting the outer limits of repository software development, I am occasionally reminded of the Knight:

‘I see you’re admiring my little box.’ the Knight said in a friendly tone. `It’s my own invention — to keep clothes and sandwiches in. You see I carry it upside-down, so that the rain can’t get in.’
‘But the things can get OUT,’ Alice gently remarked. Do you know the lid’s open?

(from Alice Through the Looking Glass, via Project Gutenberg)

Tags: , , , ,

Finely tuned antennae

Monday, December 17th, 2007

The discovery to delivery hook-line has been used for a while to describe a goal of those information services which support the academic researcher. The challenge to academic libraries, national information services etc. has been to support the researcher from the moment they begin the process of searching to the delivery of the digital or physical artefact which satisfies their enquiry.

Lately, I’ve been thinking about discovery to delivery, wondering why it just doesn’t quite work for me. I’ve been preoccupied with this mainly because I was invited to devise a diagram to express discovery to delivery - an architecture if you will - and found myself either focussing on discovery or delivery, but not really both together.

I think it is the way in which it implies a ’round-trip’ which bothers me. It sounds synchronous - almost ‘client-server’. What it is missing particularly I think, is any notion of how the researcher has registered their interest. There is, perhaps, an implication that the researcher has just initiated a search operation.

At a meeting last week involving, among others, JISC and some of the services they fund, as well as the British Library and CURL, I gave a short presentation (working to a ‘maximum 5 slides’ rule) outlining the idea that it might be interesting to consider the proposition that more and more of our information is being delivered to us without it having been explicitly asked for, and that there might be an interesting model in this for the next generation of services supporting scholarly research.

I considered the fact that, like many, I’m engaged in a sort of continuous, low-level, background research activity. Firstly, I have registered my interests explicitly with a number of online services, and receive regular deliveries of content which is often useful. Secondly, I have registered interests in a less explicit way by choosing to subscribe to the output of a number of academic and non-academic bloggers. Thirdly, some of the systems with which I am registered are starting to make recommendations to me about stuff I might want to look at. This is the technique used most prominently by Amazon, where the system offers suggestions of other items I might be interested in (’recommendations’) using its database which relates me to other users and to our respective activities in the system.

Note that I’m not explicitly seeking particular content here - I’m establishing finely-tuned-antennae to catch useful intelligence. The fine-tuning is a continuous ongoing activity but, importantly, not all of it is conscious, and not all of it is initiated by me. Currently this sort of thing is still done in a fairly passive way - I go to Amazon for example with the intention of making a purchase and Amazon tries to tempt me with what is, essentially, targeted advertising. We might not want Amazon to actively ’send’ us suggestions when it’s algorithms detect a possible sale to be made. But imagine this model applied to a repository, or a library system system. At the meeting last week we considered this scenario - and what might be possible if various services were ‘joined up’ and able to share networks of users and preferences. It seems to me that the ultimate utility of this kind of system is when it feeds useful stuff to me that I didn’t previously know I was interested in. I sometimes discover the things that it turns out I should be interested in this way.

I like the notion of ‘gestures’, recently popularised by Steve Gillmor to describe these ways in which our interests are communicated to others, registered by systems, or mined from transaction logs. As I go about my professional life, I make these gestures or indications of interest, and I ensure that my personal information system is tuned to catch the responses from these gestures. My current toolset for this is based primarily around RSS-based harvesting and subscription, but it is not limited to this.

Of course, if this type of activity continues to grow apace, then the problem of managing information discovery remains, it is just transferred closer to the researcher/user. In fact, the activity of discovery follows the (semi)automatic delivery.

Perhaps there is a new model, complementary to the first:

gesture -> delivery -> discovery

where the different elements happen asynchronously.

Is ‘discovery to delivery’ sufficient any more?

Tags: , , , ,

  • Recent Comments

  • Recent Posts

  • Syndication

  • License