All Things Distributed: Reliability

These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

February 18, 2003

Reliability - Act II

Mark picks up on my earlier post about the place of reliability in a web-services world, and the right place to implement it.

I'm for "doing reliability" in the application layer [Mark Baker]

I agree with Mark that coordination/choreography should play an important role in achieving end-to-end reliability, especially in terms of overall application semantics. However I do not agree with him that this means that there is limited or no use for other reliability mechanisms. To build high-performance runtimes you will need to account for high degrees of parallelism for example with multiple asynchronous invocations. With a few simple abstractions you will be able to build runtimes that will guarantee you correct operation while maintaining the high-performance, without the need for heavy-weight coordination technology (BTW these is not the way the techniques are used in ws-reliability). We are building some software in this manner (reliability provisions both in the runtime and at the coordination level) and we can pick up the discussion once people had a chance to play with the software.

BUT, there are a few things in Mark's response about me and my group that are a bit out of date and require some rectification (just to avoid some misunderstandings):

Wrong - We are in the group communication research business:

I haven't done any academic group communication systems for 4-5 years.
The last complete academic GCS (Ensemble) from our group was done by Mark Hayden and Robbert van Renesse and was completed more than 3 years ago. Ohad Rodeh from IBM Haifa is maintaining that system.
Ken Birman last did a multicast protocol (PBCast) about 3-4 years ago.

Wrong - We believe in reliability as a layer.

I believe that reliability and other more advanced distributed systems concepts should be provided through a combination of runtimes and toolkits, and that if you want to build correct and high-performance distributed application, the application should be aware of its distribution aspects. I am a strong advocate in the anti-transparency movement.

Wrong - These systems only scale to LAN size settings

Robbert and I did a high-performance, ultra-scalable system, based on probabilistic techniques that was specifically targeted towards large scale cluster environments (10K nodes and beyond) that was only available commercially, but that was finished about 4 years ago. Parts of it also were available in Galaxy.
Astrolabe is a distributed state management and data-mining system that is build on epidemic communication techniques and can be applied at massive scale.
Newswire, which is a peer-to-peer pub/sub system that uses Astrolabe techniques for subscription management and maintenance of forwarding paths, scaled to 100K node and higher in lab conditions. We still have to prove that we can do this under production conditions.

Posted by Werner Vogels at February 18, 2003 08:44 PM

TrackBacks

MT: It just works (and it's in Perl)
Excerpt: In the middle of scanning some weblogs (by the way, there's a excellent set of exchanges going on between Mark Baker, Jorgen Thelin and Werner Vogel at the moment about
Weblog: Bill de hÓra
Tracked: February 19, 2003 07:51 PM

Comments

Hello,

first of all please let me introduce myself.
I am a researcher at Rome University and we also are interested in providing a form reliability for web services.
In particular we are working on HTTPR specs by IBM... I would like to know if you could take a look at them what you think about it.
Paolo

Posted by: Paolo Romano on February 20, 2003 05:08 AM