I have received several questions about what exactly my goals are with the analysis of the feed access patterns, a number of them coming from journalists. I had planned to wait with tackling this explanation until we had arrived at the point where I would show how to use the feed analysis results to drive scenarios that investigate the scalability of the way we use feed readers at this moment. To avoid being preempted by other postings I would like to briefly explain some of the reasoning behind this research.
First some history: My 'official' area of research is scalability and reliability in mission-critical distributed systems, work that I have been doing since the late 80's, mostly in collaboration with Ken Birman and Robbert van Renesse here at Cornell. My investigative approach is that of applied research: I focus on issues that are inhibitors of scalability and robustness in current enterprise systems, through continuous feedback cycle with industry. This has allowed me to transfer a number of research results directly into industry breaking barriers in scalability and reliability. But all of these years of interaction with those who have to build and deploy real-life large-scale distributed systems have also thought us a number of negative lessons: we have gained many insights into which technology approaches are absolutely not scalable.
The current situation with the explosive growth of the use of desktop based aggregators is similar to the scalability problems that many other distributed systems and applications have faced. These systems were forced to restructure to be able to maintain scalability from functionality and/or efficiency point of view. It is likely that the publish/subscribe system that the feed/aggregator system will need to address the same type of issues in the near future. In future postings on this topic I will go into details what these issues exactly are and why history suggest that the need for scalability will turn these issues into problems that can bring the overall system to its knees.
For example there are two simple axis of growth of in the feed/aggregator system;
There are other axis of scalability in the feed/aggregator setting, but we'll get to those when we examine the scenarios in more detail.
The increase in scale noted above signal potential problems at different levels.
The increase in the number of feeds will leave many users frustrated, as there is a limit to the number feeds one can scan and read. Current numbers suggest that readers can handle 150-200 feeds without too much stress. But users will want to read more and more as new interesting feeds become available and they run into the limitations of the metaphor of current aggregator applications. The current central abstract of aggregators is that of a feed, and there is a limit to how many individual feeds one can actually handle. Aggregators will need to find ways in which the users can be subscribed to a select set of feeds because they want to read everything that comes from these feeds, but also subscribe to a much larger set of publishers for which the feed abstraction may not be the right metaphor. Aggregation, fusion and selection at the information item level instead of at the feed level seems to be a first abstractions to investigation. Advances in how users can specify what information they would like to see, will be enablers for scalability at the human level. And what the real questions in this area are, XPath and XQuery are not the answer, 3D graphical representations of related information units are more likely to be the ways to approach this. And how the applications can learn and adapt to shifting interests and new information sources will be equally important in the scalable handling of the continuous stream of new feeds becoming available. This support does not necessarily have to be desktop based, but could be assisted through (de-)centralized information fusion services, possible commercial ones. www.pubsub.com is an example of such service in its infancy.
A massive increase in the number of aggregators is a threat to the infrastructure on which the feed/aggregator interaction is build. Aggregators use a pull-based approach in a closed feedback loop. This means data is not only fetched when a user actually wants to read it, but that sources are polled continuously asking if new material is available independent of the users actions. The polling for data was an approach that was also used many older distributed systems and has shown to be something that can bring a system to a halt if the number of subscribers increases. There is a range of approaches that can help with this problem ranging from combinations of push & pull, asynchronous notification mechanisms, feed hosting in decentralized clouds, feed and information brokers, intermediary feed aggregators, collaborative delivery of feed content, optimizations in feed transfer, etc. Most of these potential solutions will be the subject of the scenarios once when we are going through the example cases. I won't give to many spoilers about these results, stayed tuned in the coming months or two for a number of interesting postings on this subject.
There appears to be no immediate crises in the infrastructure and I am sure many will ridicule the notion that this may become a problem in the future. But it seem ridiculous to dismiss all these lessons learned from building scalable distributed systems and not apply them to this instance of a global publish/subscribe system. If the growth in the number of aggregators will continue, we will hit a wall with the current approach. Research and product development is starting to see how these scalability issues can be addressed, we just have to hope that the current content management system developers and the aggregator developers will keep an open mind in helping to find and implement solutions that will guarantee a healthy future.
Posted by Werner Vogels at April 12, 2004 10:13 AMThere was a discussion of RSS and scaling a while back, here: http://www.teledyn.com/mt/archives/001496.html
Posted by: Seb on April 18, 2004 03:04 PM