A brief comment, as I hop across the North Sea back to Bristol.
With the news that arXiv will now accept deposits from institutional repositories, Dorothea Salo continues her theme about a deposit flow which goes from author, to institutional repository, to subject/discipline repository. Dorothea offers some scenarios, including:
Achaea University adopts a Harvard-style open-access mandate. If she wants her articles in arXiv as well, Dr. Troia must rather annoyingly dual-deposit… unless Achaea’s IR implements a deposit pipeline to arXiv, in which case the most she has to do is tick a ticky-box (and I can imagine ways to abstract away the ticky-box).
In an abstract sense I appreciate the notion of the ‘deposit pipeline’. I also agree with the main point which is about the direction of the flow. Indeed, I have previously characterized the institutional repository as being, or more usually containing, the source repository. However, I remain slightly doubtful about the need for the flow to be initiated by the source. If there were some mechanism by which the subject/discipline repository could be alerted to the appearance of relevant materials in the institutional repository, then doesn’t it make sense for the subject repository to fetch the record/artefact, rather than wait to have it sent. Well, we already have the mechanism, it’s called RSS (or Atom) and it’s already supported by some of our most popular repository software.
Come to think of it, an even better approach might be for the subject repository, having been alerted to a new & relevant deposit in the institutional repository, to simply maintain a pointer to the original (optionally creating new and related resources)
In other words, as a certain generation of programmers would put it, pass by reference, not by copy.

Not only do we already have the mechanism, as you say, it is already a MUST for an AtomPub collection (http://tools.ietf.org/html/rfc5023#section-10). Adding back this piece of AtomPub that SWORD has pruned away opens up new possibilities.
Yes, and there are other wrinkles, too, such as metadata-only repos like RePEc.
Calling myself out for tunnel vision here — my thinking tends to be constrained by the available standards. OAI-PMH doesn’t have querying, so it can’t be used as your alert service. SWORD is one-way, so it can’t harvest, and OAI-ORE neither queries nor (exactly) harvests. So we have some technical hurdles to think about.
But I’m also thinking from the point of view of Dr. Troia, who is not going to be happy with the idea of “just sit back; arXiv will harvest your stuff eventually.” She’s going to want to push that button or tick that ticky box and be assured that magic happens (whatever kind of magic it is) and the next time she searches arXiv her thing will be there.
Does that make sense? It’s not necessarily that Dr. Troia ticking a ticky box has to be the ONLY way stuff gets from Achaea’s IR to arXiv, just that for users’ sake, it has to be ONE way.
Dorothea: I completely understand the desire for a tick-box.
I also think that the flow should happen both ways: from IR into arXiv and from arXiv into IR.
I have no problems with duplicated data… it’s called LOCKSS.
It’s also why we have things like OpenURLRouters: to solve the “Appropriate Copy” problem.
Dorothea,
undoubtedly there are aspects of user expectation and confidence which need to be addressed.
I think the technical issues of transport and alerting mechanisms are very tractable – we are more than half-way there in many cases.
The aspect I find most interesting is how, to use your example, arXiv could register its interest in certain material being deposited in Achaea’s IR. It might simply subscribe to a feed alerting it to new deposits by Dr. Troia for example. arXiv could apply whatever sanity checks they wanted (even human moderation) in case Dr. Troia has deposited an inappropriate paper. A more sophisticated feed might be based on a ‘saved search’ with several parameters. I think ePrints may support this kind of thing to an extent.
Having said this, I do accept that the tick-box approach may well turn out to be the most manageable approach to actually deciding what gets added. But I am very interested in looking at alerting and event-driven workflows nonetheless.
Sean,
I am familiar with AtomPub, less so with SWORD. I’ll run your comment past colleagues at UKOLN who are involved in SWORD to see if they want to respond.
Code Gorilla,
the duplication of artefacts on the web causes issues not addressed by appropriate copy solutions. LOCKSS is one approach to preservation – not necessarily the best.
Duplication of web-resources can play havoc with an individual item’s ranking in global search engines for example – seriously undermining ‘discoverability’.
“In other words, as a certain generation of programmers would put it, pass by reference, not by copy.”
I understand your point about pass by reference and not by copy. However if we are taking a leaf out of the old school programmers, how are we going to avoid issues like, null pointer exceptions?
@paul: Surely you can’t be aspiring to having a single (“golden”) copy on T’Internet?
The whole premis of the Internet is built around redundancy, so surely building that idea into any back^H^H^H^Harchiving (aka Repository) system is A Good Thing[tm] – no?
Code Gorilla: While it could be argued that redundancy is a feature of the internet, I don’t think it’s a defining characteristic of the Web at all…. in fact the Web Architecture essentially denies the possibility of reliance on copies to achieve redundancy.
If we choose to introduce redundancy at a higher level, with LOCKSS for example, this is not because the underlying architecture especially encourages this.
Your point about the single authoritative source for an artefact on the Web is interesting – without having reflected for more than a couple of minutes, I’m tempted to say that I do aspire to this principle in the context of the Web, because I think it does actually fit well.
Thanks for the comment – it has set me off on a train of thought about how the Web and the Internet it is layered onto may have some important differences at a fundamental level. One encourages redundancy perhaps, the other feigns indifferent but secretly prefers to avoid it?
Coming from way the heck outside of this debate, but having once worked as a higher ed webmaster, the duplication problem can be quite obnoxious, depending on the specific content. (I remember multiple departments having their own copy-pasted versions of their sections of the college catalog, quite often woefully out of date. Not a good thing for students, and I worked very hard to try to get people to point instead of copying!)
Elaine,
indeed! The most obvious manifestation of this problem in the repositories/scholarly comms space is in the issue of versions of documents – something the community has been wrestling with for some time. I think you’re correct in that duplication without very careful management tends towards a degradation in data quality.