Archive for the ‘Information Management’ Category

Library hackers FTW

Friday, November 28th, 2008

Yesterday I went along to Mashed Library UK 2008 in London. Quickly abbreviated to ‘mashlib’, the event was the brain-child of Owen Stephens. Owen did most of the organising, aided by David Flanders who provided the space at BirkBeck college, and our excellent events team at UKOLN. The event was sponsored by UKOLN, using funding from the JISC.

I thought the balance of activities on the day was excellent – a healthy mixture of short presentations, demonstrations and a good amount of hands-on hacking. The group was comprised of commercial vendors (Talis, ExLibris, OCLC), academic-library folk (the majority), a lone representative from the public library world (Paul Bevan for the National Library of Wales), and a few developers from various (mostly JISC-funded) services.

Rob Styles from Talis gave us a demo of the Talis Platform. There is an open API which you can play with – it’s quite impressive. I was very struck by some of the language Rob used in his demo – he talked about dipping, where a result-set from a query (in RSS 1.0 format) is “dipped into” another – with the original data-set accreting more infromation from the second. (Jim Downing and I had an interesting chat about this over lunch, with Jim proposing that we could visualise data-sets as molecules – having a certain shape which allows them to bond with other molecules which have a complementary shape). Rob also talked about mixing in in a smiler vein. The Talis Platform APIs appear to be quite RESTful, with a good deal of passing URLs around rather than result-sets. I plan to have a closer look at this.

Timm-Martin Siewert spoke next about the ExLibris Open Platform. I did get a URL for this but it takes me to a page whcih challenges me for a username and password which I do not have. The Open Platform is , apparently, open to paying customers only. Edward Corrado suggested via a tweet that:

I think they mean open in the sense of the open systems movement of about 20 years ago

Next up was Mark Alcock, standing in for Tim McCormick and representing OCLC, to talk about the WorldCat Developer Network. Mark came armed with a bunch of limited life API keys, so that people could try out some of the WorldCat services. OCLC appear to be offering a spectrum of services, from the commercial pay-for-use variety, to the ‘affiliate’ model – i.e. form a business partnership with us and use our services, to some free services. I’m interested in several of the WorldCat services but am wary of getting too fond of something I cannot, in the end, afford to use. Unfortunately, I did not get time on the day to make use of Mark’s API keys.

I noted that the three vendors represented seem to be spaced evenly along a spectrum of openness, with Talis at the ‘very open’ end of the spectrum, ExLibris at the ‘closed’ end, and OCLC (specifically WorldCat) somewhere in between. I can’t yet see how Talis are going to monetise the completely open model, and I think ExLibris will certainly need to open up somewhat. Perhaps OCLC have hit a sweet-spot of openness? I really don’t know enough about these services in detail, but I noticed some comments from Dorothea Salo which are somewhat critical about the business model behind WorldCat.

Ashley Sanders followed, with a quick description of an Atom (APP) based object store he is developing as part of his work extending the COPAC service. I’m following COPAC developments with interest – I’m very much in favour of the general direction they seem to be taking (I recently blogged about one aspect of this).

Tony Hirst, mashup maestro, gave a tour-de-force demonstration of using Yahoo Pipes and Google Spreadsheets as mashup tools. This went down very well with the technically-minded-but-mostly-not-developers group – especially Yahoo Pipes. I gave a presentation at the Shock of the Social in March 07 where I remarked that the potential of Yahoo Pipes was to do for web development what the spreadsheet did for non-web development before it (Microsoft Excel has been described as the most widely used Integrated Development Environment). Tony showed us how the spreadsheet is certainly relevant in a web-mashup world with his demonstrations of using Google Spreadsheets to mashup data-feeds.

Later on, after lunch, the group got down to some general hackery. On Twitter, Chris Awre (who wasn’t at the event but had been following comments on Twitter) remarked:

Silence from #mashlib08 this afternoon. The mashing must be going well…

And he was right! There was a fair stream of Twitter commentary in the morning – but it dried up as people got absorbed in hacking code and testing interfaces. I saw people exploring the Talis Platform and, in particular, Yahoo Pipes. I expect there will be some blogging about this activity – look out for the official tag:

mashlib08

Andrew McGregor of JISC has already written up his experience of this , as has Jo Alcock – I think these posts describes representative experiences of the event.

Paul Bevan rounded off proceedings with a view from public libraries – the National Library of Wales to be precise. I learned a lot from this presentation about the unique challenges facing the public non-academic sector.

I thoroughly enjoyed the day – kudos to Owen for getting the right balance of people, subjects and activities. There was a ‘buzz’ generated as the day went on which was excellent. I have been to a fair number of ‘hacker’ events where the emphasis is on the tools and the running code – I generally enjoy this kind of thing. But mashlib08 was different – what was really good about this day was that the enthusiasm came from doing stuff with information, more than from the actual development.

I think Tony Hirst deserves a special tip o’ the hat for firing up a real enthusiasm for mashups on the day.

We should definitely do this again!

Blog commons?

Friday, July 25th, 2008

You may have noticed that I have included a statement on this blog’s ‘home-page‘ to the effect that:

This work is licensed under a Creative Commons Attribution 2.0 UK: England & Wales License.

This is standard blurb from the Creative Commons (CC) site. In the context of my blog this means – well, what exactly? Feel free to use anything you find here, for whatever purpose you like, so long as you credit me? What about material I include from elsewhere? What about other people’s comments on my posts? It seems to me that this just isn’t clear enough….

And another problem – I don’t necessarily want to apply the same license, indiscriminately, to all of my posts. I probably want credit/attribution for anything I write here, true, but I might feel differently about commercial re-use of the contents of different posts (although I’m probably deluding myself if I think that my blog has potential for commercial exploitation!).

In point of fact, I actually changed the license on my blog a while ago, to remove the non-commercial use clause from my Creative Commons 2.0 license. I guess this is pretty poor practice as it has, by implication, retrospectively changed the license I applied to past entries. So far, no one has complained…. ;-)

Would it be better practice to attach a license to the text of each post, rather than to the blog as a whole? Is the ‘post’ closer to being a ‘work’ in CC terms? Even better, should I embed the license as a footnote to the content itself? Currently, my CC license declaration is simply an artefact of the user interface I host at http://blog.paulwalk.net/index.php – it doesn’t even appear in the RSS feed. If I licensed each post, rather than the blog as a whole, I could be selective about licensing content (perhaps maintaining a sensible default to avoid unnecessary work). And I could move to a different license later without feeling vaguely guilty. I guess I could include a statement making clear to people who want to post comments on my blog just how their comments are going to be licensed. Or even allow the them to select a license themselves….!

It occurred to me that someone might have developed a ‘Creative Commons License plugin’ for Wordpress, the blog engine used to manage this blog. In fact, I found two very easily, WpLicense and the Creative Commons Configurator. However, both of them apply the CC license in a system-wide manner, rather than to each individual post. This is an improvement over my current practice, as the license will show up in the blog’s public RSS feeds for example, but it’s not really what I have in mind. I’m pretty sure I could insert license statements in the necessary templates if it came to it, and maybe code up a plugin to allow me to select from a menu of licenses. However, it occurs to me that I don’t particularly want to use Wordpress as the ‘author’ tool (currently I use Ecto)

Whatever. I can’t help thinking that attaching a license to a blog feels a little like licensing the deployment of a content management system, rather than the content itself. Anyone care to comment?

Making digitised content available for searching and harvesting(2)

Monday, April 28th, 2008

Back in February I was asked to give a talk to the JISC Digitisation Programme meeting. I blogged about this shortly beforehand asking for comments and suggestions. The response was fantastic – I received a bunch of great suggestions and incorporated many of them into the presentation. Everyone who commented got a public ‘thankyou’ at the event, and I included all names in the slides I used.

I have finally gotten around to making the slides available (someone who was at the meeting has asked for them so they made some sort of impression with someone!).

Thanks again.

Digital library pipeline for a million books.

Sunday, March 16th, 2008

I was pleased to be invited by Brian Fuchs to a ‘Million Books Workshop’ at Imperial College, London last Friday. A fascinating day, in the company of what was, for me, an unusual group of 20-30 linguists, classical scholars and computer scientists. The morning session consisted of three presentations (following an introduction from Gregory Crane which I missed thanks to the increasingly awful transport system between London and the South West) which brought us up to speed with some advances in OCR, computer aided text analysis and translation, and classification. The presentations were intended to form an ordered progression:

  1. From Image to Text: OCR and Mass Digitisation (Dr. Thomas Breuel, DFKI and Technical University Kaiserslautern)
  2. From Text to Information: Machine Translation and Syntax Recognition (David Smith, Johns Hopkins University, & David Bamman, Perseus Project)
  3. From Information to Learning: Machine Learning and Classification Techniques (David Mimno, U Mass, Amherst)

Listening to these presentations, I quickly found myself well outside of my comfort zone, in terms of both the science and the domain (classical literature), so it was a challenging and exhilarating morning! It was difficult to take comprehensive notes as I had to really concentrate on the presentations at times in order to follow them – the pace was pretty smart, with jargon and ‘in jokes’ galore.

David Smith, Johns Hopkins University gave a fascinating and entertaining presentation which outlined some of the challenges, and advances, in language parsing and translation. He pointed out that although the structured view of the semantic web is a seductive one, even the newer online, digital genres such as email, blogs mostly use unstructured or semi-structured text. However, parsing free text is very difficult, especially with the growing scale and diversity of texts available on the web. To illustrate this he employed a series of (sometimes amusing) translations from the Google translation service. The best available technology today uses supervised machine learning techniques to build a treebank. An alternative approach employs semi-supervised, modelling techniques. Parallel texts in different languages are useful but, for some languages, only the Universal Declaration of Human Rights exists as a parallel text! As an aside, David pointed to the potential advantage in search engines searching several languages: if you enter your query in English for example, by searching resources in other languages, the search engine automatically expands the search, utilises synonyms etc. ‘for free’. This can then be more effective than monolingual searching. David offered a future based in pragmatism: translation support rather than fluent translation.

David Mimno presented on classification, sequences & topic modelling. In an interesting talk, it was the visualisation (as a topical transfer graph) of topic relations extrapolated from citations in a set of scholarly communications which really got the audience engaged – a series of questions ensued before David could move on. He illustrated his work with accessible examples: for example, it turned out from one experiment that the single term most likely to identify email spam was, believe it or not, the word “red” showing in the markup, owing to the fact that “only spammers use red text”! Apparently, he had a system which could classify any of Shakespeare’s plays as tragedy, comedy or history…. with the exception of Romeo and Juliet, which comes out as a comedy for some reason….

The takeaway for me was that some of the technology in these spaces is maturing. Thomas Breuel, for instance, made a compelling case for really effective OCR (Optical Character Recognition) in his description of the open-source OCRopus project, which he leads and which is sponsored by Google. Building on previous systems like the character recognition system Tesseract, OCRopus employs a modular design with components which offer the following workflow, focussing on the processing of scanned books:

layout analysis -> isolated character recognition -> statistical language modelling – > text

The project is heading towards a beta release this year, and the team plan to create a deployment ‘bundle’ in the form of an Amazon EC2 AMI. I didn’t quite catch the details but I think they have found a way to monetise this through the Amazon referral program, which sounds interesting. In any case, the idea is that one could take the AMI, deploy it, run it for a few hours to process a particular scan, and then shut it down again – potentially a very cost-effective way of proceeding. Thomas made the point that, as OCR technology continues to improve, we are likely to want to process scans of books several times. He explained how the project was aiming towards a “full digital library pipeline”, a system which could be deployed from a connected laptop: with the new affordability of powerful digital cameras, a researcher might photograph a book’s pages themselves before feeding the resulting image into the OCRopus workflow OCRopus can handle the distortion effect of non-flat pages very effectively). Another interesting aspect of this work is the distributed parallel training which underpins the statistical language modelling: a large model is achieved by combining many little models created by many people, through the web. If you are interested in this area, then you should also check out the hOCR format specification and tools.

I had been invited to this workshop because of my role and interest in the deployment of services at a community and network level. I joined a panel at the the very end of the day where we were invited to consider what services and infrastructure might be required, in the UK, to support the digitisation and useful processing of a ‘million books’. We didn’t get very far with this because we had run out of time and, I suspect, energy by this point, but the question remains…. I’ll be picking this up with some colleagues in due course.

Fascinating day, and topped of with a quick pint standing outside a packed London pub in a light drizzle, which was actually a refreshing and pleasant way to conclude!

A minor response to Repositories thru the looking glass

Thursday, February 14th, 2008

In Repositories thru the looking glass over on the eFoundations blog, Andy Powell gives a summary of a keynote he gave to the Vala Conference last week. It’s interesting stuff, and I will take the time to look at the presentation slides as well. I mostly agree (vehemently in some instances) with Andy’s points, though I do find myself questioning some parts of this, so I’ll quote some snippets and make a few comments here.

Firstly, that our current preoccupation with the building and filling of ‘repositories’ (particularly ‘institutional repositories’) rather than the act of surfacing scholarly material on the Web means that we are focusing on the means rather than the end (open access)

It’s hard to deny that there is a current preoccupation with establishing repository systems of one kind or another and populating them with content, and also that there is a focus on institutional deployments. However, I’m not convinced that open access is (or at least is going to remain) the sole driver behind the development of institutional repositories. From an institutional perspective, it absolutely makes sense to want to manage the outputs of research conducted within the auspices of that institution.

A common use for an institutional repository is to house eprints. Were it not for the open-access imperative, we might have expected software designed to manage eprints to fall somewhere between a document-management and a content-management system – both familiar to a large number of institutions. I think it is interesting that it might be considered to be open-access which has skewed the development of repository software in some respects – the community has largely started from scratch, building repository software, where it might have made more sense to simply adapt what was there.

So I half agree with Andy – we do seem to be focussed on the means, but I think I am sympathetic to those (institutions at least) who find themselves pre-occupied with this.

Secondly, that our focus on the ‘institution’ as the home of repository services is not aligned with the social networks used by scholars, meaning that we will find it very difficult to build tools that are compelling to those people we want to use them. As a result, we resort to mandates and other forms of coercion in recognition that we have not, so far, built services that people actually want to use. We have promoted the needs of institutions over the needs of individuals.
Instead, we need to focus on building and/or using global scholarly social networks based on global repository services.

There are four sentences here, and I completely agree with the first three and a half! I find myself wondering who ‘we’ are in this. Now that institutional repositories are becoming a reality, the ‘we’ is going to expand to include people who simply have institutional interests – who have no real interest in open-access for example beyond it being a requirement for them to support. The MIS Manager of your average institution, for example, will start to get involved once institutional repositories get embedded into the business which is a university. The half sentence I don’t quite buy is the “global repository services”. Why can’t we “focus on building and/or using global scholarly social networks” (which I support) based on institutional repository services? We don’t have a problem with institutional web sites do we? Or institutional library OPACs? We have certainly managed to network the latter on a global scale, and built interesting services around this….

Finally, that the ’service oriented’ approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the ‘resource oriented’ approach of the Web architecture and the Semantic Web. We need to recognise the importance of REST as an architectural style and adopt a ‘resource oriented’ approach at the technical level when building services.

Absolutely – couldn’t agree more. Yesterday, at a JISC committee meeting, I argued that a resource-oriented-architecture and the service-oriented-approaches being encouraged by the e-Framework could complement each other if intelligently and judiciously applied. Incidentally, last Friday, I attended an excellent CRIG workshop devoted to exploring the relevance of ReST to repositories. Matt Zumwalt of MediaShelf showed a working ReST interface on Fedora, and Oxford University’s Ben O’Steen used this to develop a client app, in real time, in Python.

I think we agree that the individual’s interests may often be orthogonal to those of the institution. This may have always been the case but it is, perhaps, increasingly an issue as recent developments and trends on the Web empower the individual at an accelerating rate. I wonder if the user-centric/institutional/global debate around repositories is just symptomatic of a tension about to become apparent all over the (institutional) Web?

Having said all this, when visiting the outer limits of repository software development, I am occasionally reminded of the Knight:

‘I see you’re admiring my little box.’ the Knight said in a friendly tone. `It’s my own invention — to keep clothes and sandwiches in. You see I carry it upside-down, so that the rain can’t get in.’
‘But the things can get OUT,’ Alice gently remarked. Do you know the lid’s open?

(from Alice Through the Looking Glass, via Project Gutenberg)

Finely tuned antennae

Monday, December 17th, 2007

The discovery to delivery hook-line has been used for a while to describe a goal of those information services which support the academic researcher. The challenge to academic libraries, national information services etc. has been to support the researcher from the moment they begin the process of searching to the delivery of the digital or physical artefact which satisfies their enquiry.

Lately, I’ve been thinking about discovery to delivery, wondering why it just doesn’t quite work for me. I’ve been preoccupied with this mainly because I was invited to devise a diagram to express discovery to delivery – an architecture if you will – and found myself either focussing on discovery or delivery, but not really both together.

I think it is the way in which it implies a ’round-trip’ which bothers me. It sounds synchronous – almost ‘client-server’. What it is missing particularly I think, is any notion of how the researcher has registered their interest. There is, perhaps, an implication that the researcher has just initiated a search operation.

At a meeting last week involving, among others, JISC and some of the services they fund, as well as the British Library and CURL, I gave a short presentation (working to a ‘maximum 5 slides’ rule) outlining the idea that it might be interesting to consider the proposition that more and more of our information is being delivered to us without it having been explicitly asked for, and that there might be an interesting model in this for the next generation of services supporting scholarly research.

I considered the fact that, like many, I’m engaged in a sort of continuous, low-level, background research activity. Firstly, I have registered my interests explicitly with a number of online services, and receive regular deliveries of content which is often useful. Secondly, I have registered interests in a less explicit way by choosing to subscribe to the output of a number of academic and non-academic bloggers. Thirdly, some of the systems with which I am registered are starting to make recommendations to me about stuff I might want to look at. This is the technique used most prominently by Amazon, where the system offers suggestions of other items I might be interested in (’recommendations’) using its database which relates me to other users and to our respective activities in the system.

Note that I’m not explicitly seeking particular content here – I’m establishing finely-tuned-antennae to catch useful intelligence. The fine-tuning is a continuous ongoing activity but, importantly, not all of it is conscious, and not all of it is initiated by me. Currently this sort of thing is still done in a fairly passive way – I go to Amazon for example with the intention of making a purchase and Amazon tries to tempt me with what is, essentially, targeted advertising. We might not want Amazon to actively ’send’ us suggestions when it’s algorithms detect a possible sale to be made. But imagine this model applied to a repository, or a library system system. At the meeting last week we considered this scenario – and what might be possible if various services were ‘joined up’ and able to share networks of users and preferences. It seems to me that the ultimate utility of this kind of system is when it feeds useful stuff to me that I didn’t previously know I was interested in. I sometimes discover the things that it turns out I should be interested in this way.

I like the notion of ‘gestures’, recently popularised by Steve Gillmor to describe these ways in which our interests are communicated to others, registered by systems, or mined from transaction logs. As I go about my professional life, I make these gestures or indications of interest, and I ensure that my personal information system is tuned to catch the responses from these gestures. My current toolset for this is based primarily around RSS-based harvesting and subscription, but it is not limited to this.

Of course, if this type of activity continues to grow apace, then the problem of managing information discovery remains, it is just transferred closer to the researcher/user. In fact, the activity of discovery follows the (semi)automatic delivery.

Perhaps there is a new model, complementary to the first:

gesture -> delivery -> discovery

where the different elements happen asynchronously.

Is ‘discovery to delivery’ sufficient any more?