“Any any any old data”
Over on ZDNet, Paul Miller has blogged some thoughts about what he calls the ‘Data Cloud’. He points out that in the evolution of the ‘cloud computing’ paradigm, the:
…emphasis for much of this wider discussion remains firmly rooted in the realm of computation and storage. On many levels it’s about offloading the costs of scaling and maintaining local infrastructure, and ‘data’ doesn’t really enter the conversation at all. Something is ‘stored,’ but it’s a nameless, faceless, shapeless something that merely exists in order to be stored or computed upon.
Initially, Paul posted the germ of this idea to Twitter, where I responded with a degree of scepticism. Having given it a little thought, I remain sceptical. However, I have realised that my own, internal, ideas of what the ‘Cloud’ entails has informed my scepticism, so I figure it might be worthwhile externalising these ideas. (Note that Paul has helpfully included in his post a variety of definitions from good sources, so I won’t revisit these here. Like such celebrated memes as ‘Web 2.0′, the meaning of ‘cloud’ in this context is delineated by broad consensus, rather than strict definition. Also, I suggest that the cloud is highly connotative - depending on the exact context within which it is used it can imply much.

The word itself must surely have come from all those network diagrams which included a cloud to denote the ‘great outdoors’ - i.e. the stuff beyond the local area network. (I actually remember seeing such a diagram years ago with “here be dragons” written inside the cloud).
Anyway, for what it’s worth, here are some of the characteristics which I think are important, and why I disagree (perhaps not very strongly) with Paul:
Remotely hosted:
In a literal, basic sense, if services or data are in the cloud, then they are hosted remotely, on someone else’s infrastructure. The immediate implication might be that the user also doesn’t particularly care, or even know about the details of this arrangement. At one level, this is nothing new - and if the data cloud is just meant to signify data out there, then OK - but this notion is almost as old as computer networking itself, and was certainly present at the birth of the Web.
However, the reason that the cloud meme has gained such traction over the last two years lies in the new possibilities for moving not just data, but applications, services and even infrastructure onto remote servers. Closely aligned with the Cloud in this context is Software as a Service (SaaS), which in contemporary terms means the delivery of application-specific functionality from a remote source, typically to a modern browser.
Ubiquitous:
If it’s in the Cloud, then it is available anywhere. There are many examples of where this statement could be challenged but there is, nonetheless, an expectation that if an application is delivered to me from the Cloud then I ought to be able to access and use it from any connected device with the requisite software. There is a weaker assumption that the requisite software might be simply a modern web browser.
Commodified:
One of the really interesting developments of recent years has been the introduction of infrastructure services to the Cloud. This moves an important aspect of computing services closer to the ‘utility’ model. I know which company ’supplies’ my electricity because they take large amounts of money off me and regularly send me ‘advice’ on how to reduce my bill (in case you’re wondering the best advice is to, “switch off things which are powered by electricity when you’re not using them”). However, I don’t know where that electricity is being generated, and frankly, one lot of electricity is much like another, regardless of who supplies it (in the UK at least!). So, I suggest that commodification works best where the commodity is undifferentiated. The history of computing is filled with examples of evolution towards undifferentiated supply of functionality - abstraction is the method used to achieve this. For example, if I want to run Linux on my servers, then I can use a variety of hardware, without much having to worry about this. If I pay someone else to provide me with Linux servers in the Cloud (this blog is running on one such), then I can get away with not even knowing the specifics of the hardware which hosts my system. To an extent, in trusting your infrastructure to a third party, you are saying “I trust you, look after this lot for me please and don’t bother me with the details”.
In fact, we have now reached the point, with services such as Amazon’s EC2 service, where we can say, “I’d like some computing power please - any old cycles will do”.
And right here is why I think I disagree with Paul. If you believe, as I do that the Cloud implies a move towards undifferentiated, commodified hardware and services, then I don’t see how to include data, at least most data. How often do you hear a user say, “I’d like some data please - any old data will do”. The value of data is often measures in terms of scarcity, provenance, authority, quality. When Paul describes data as a:
nameless, faceless, shapeless something that merely exists in order to be stored or computed upon.
I think he’s right - this is how data is represented in the Cloud. Where we differ, I guess, is that I think that this is a reasonable and useful way for the Cloud to treat data - it allows the Cloud to become ubiquitous and undifferentiated, feeing up the our time to concentrate on what we really care about - our data.
I’ll end with a song……Any old iron, any old iron, Any any any old iron….
October 8th, 2008 at 1:12 am
Paul, maybe you are saying that the cloud weakens provenance? I followed through on the connected data sources image on Paul’s post, and then linked to Geonames. In their example, “[2] http://sws.geonames.org/3020251/about.rdf” stood for a document about the particular place. This document could be anything; it could be from Wikipedia, from a tourist brochure, from the town council… and it’s easy to see that the document would be different in each case. The provenance would help me distinguish and decide how much to trust.
It’s perhaps a little different with real data… do I care who measured the temperature at station 435 yesterday noon, as long as someone did? Well, for scientific purposes, I guess one would care, so the provenance is important. But for many ordinary purposes I think one would not care much, unless one found reasons to distrust those data, then chasing up the provenance would be an issue. (A colleague told me once he was annoyed how much the BBC’s weather forecasts for his home town differed from the Met Office forecast on which they were supposedly based… after parallel correspondences with someone in the Met Office and someone in the BBC, it became clear that the BBC’s parameters for identifying the town or otherwise selecting the data for the forecast were wrong, in other words the BBC had been passing off a forecast from somewhere else for years… apocryphal, alleged, etc).
There’s also different ways of things being “in the cloud”. Some services are hosted by EC2/S3 and so are “in the cloud”, but have perfectly “real-looking” URIs. Nothing wrong with doing the same thing for data
I’ve argued with Lorcan sometimes that “in the cloud” is not too meaningful, ditto “moving to the network level”. There’s real hardware, real servers, real OS’s providing these services. What the hell difference does it make from my hardware and servers. Maybe the only thing is, I don’t have to be bothered about some things that I used to have to worry about.
So, not knowing whether I’m agreeing with you or not, maybe the thing we want is data, as authoritative as we need, with provenance available, clearly identified. But _where_ it is? Am I bovvered?
October 30th, 2008 at 4:45 pm
I’ve done my own blog post on this, but I feel that actually EC2/S3 etc. are really ‘on the otherside of the cloud’ - they aren’t really ‘of it’. I agree there is an issue of provenance, but actually think for S3 etc. provenance is just as important for this as a service as it is for data.
I suspect there is something lurking in both ideas about being ‘distributed’ - which seems to me to be linked to the idea of ‘remotely hosted’ but goes beyond this.
My understanding is that the Internet was designed in such as not to have a single point of failure - you could always use an alternative route. I think this is an essential point of the idea of the internet as a commoditised network. When we talk about Cloud computing I can’t see this is true - so Cloud computing is not analogous to the network Cloud. I think the idea of a ‘data cloud’ makes more sense in this context - it has more similarities to what is being called cloud computing than the idea of the network cloud.