Advanced Topics in Distributed Systems

Monday, Wednesday, and Friday from 1:30-2:20

RHPH 162 (note change)

Chris Clifton

Email:

Design and control of distributed computing systems (operating systems and database systems). Topics include principles of naming and location, atomicity, resource sharing, concurrency control and other synchronization, deadlock detection and avoidance, security, distributed data access and control, integration of operating systems and computer networks, distributed systems design, consistency control, and fault tolerance.

A more detailed course description prepared for the CEE program is available, as is a course preview briefing containing more detailed information on requirements and expectations. The course outline is given below.

More course information may be available in WebCT (direct link).

Please add yourself to the course mailing list. Send mail to mailer@cs.purdue.edu containing the line:

add your email to cs603

Feel free to send things to the course mailing list if you feel it is appropriate. An example might be a pointer to a particularly helpful on-line manual describing an API used in one of the projects.

Course Methodology

The course will be taught through lectures, with class participation expected and encouraged. There will be frequent reading assignments to supplement the lectures.

For now, Professor Clifton will not have regular office hours. Feel free to drop by anytime, or send email with some suggested times to schedule an appointment. You can also try H.323/T.120 desktop videoconferencing (e.g., SunForum, Microsoft NetMeeting.) You can try opening an H.323 connection to blitz.cs.purdue.edu - send email if there is no response.

Prerequisites

The official requirement is CS 503 (Operating systems), with CS 542 (Distributed Database systems) recommended. The practical requirement is a solid undergraduate background in computer science including some database and operating systems theory, and substantial programming experience. If you don't have 503, but feel you have sufficient background, please send me an explanation of why you feel you are prepared, along with a number/times for me to call and discuss approving your registration.

Text

The following is recommended (it will be a useful reference for much of the lab work in the course):

Internetworking with TCP/IP Vol.III: Client-Server Programming and Applications,
D. E. Comer and D. Stevens,
Prentice Hall,
(choose appropriate version for your favorite platform),
0-13-032071-4

The following have been recommended in the past, and may provided useful background reading. However, none are required.

Distributed Systems, 1993
Sape Mullender
Prentice Hall
0-201-62427-3
Distributed Algorithms, 1997
Nancy Lynch
Morgan Kaufmann
1-55860-348-4
Distributed Operating Systems, 1995
Tanenbaum
Prentice Hall
0-13-219908-4

Evaluation/Grading:

Evaluation will be a subjective process, however it will be based primarily on your understanding of the material as evidenced in:

Midterm Exam (25%)
Final Exam (35%)
Projects (4-5) (40%)

Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted. (If everyone has a notebook with wireless connection and all agree they want to use them in the exams, I could relax this.)

I will evaluate projects on a five point scale:

5: Exceptional work. So good that it makes up for substandard work elsewhere in the course. These will be rare.
4: What I'd expect of a Ph.D. candidate. This corresponds to an A grade.
3: Good enough for a Master's degree, but not what I'd like to see for a Ph.D. candidate. This corresponds to a B grade.
2: Okay for a Master's candidate who does extremely well in other courses. This corresponds to a C grade.
1: Not good enough for a graduate student. But something.
0: Missing work, or so bad that you needn't have bothered.

Projects

A substantial portion of your education in this course will come through performing programming projects: building components of a distributed system. Some examples of what projects might involve are:

Building a server capable of handling multiple simultaneous TCP/IP connections using the Socket API. The server would be trivial (e.g., calculate the square of the input and return the result after a five second delay), the key effort would be the API.
Implement an application that connects to a (provided) CORBA server.
Implement a clock synchronization protocol.

My current expectation is that all projects will be done individually, as it is probable that some of the CEE students will not be collocated with other students in the course.

Policy on Intellectual Honesty

Please read the above link to the policy written by Professor Spafford. This will be followed unless I provide written documentation of exceptions.

Late work will be penalized except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)

The penalty for late work is 1 point (of the possible 5) if turned in after the deadline, and one additional point for each week late.

Syllabus (numbers correspond to week):

Project start/due dates are tentative!

Course overview, Components of a distributed system
Communication Mechanisms
- Message Passing
- Stream-oriented communications
- Remote Procedure Call
- Remote Method Invocation
Remote Method Invocation: Mechanisms
- DCE RPC (reading)
First project starts January 23
- Java RMI (reading)
- SOAP (Reading: SOAP 1.1 spec, XML Protocol Working Group, Apache SOAP)
Naming
First project design due January 30
Clock Synchronization
- What is clock synchronization?
  Leslie Lamport, "Time, clocks, and the ordering of events in a distributed system", Communications of the ACM 21(7) (July 1978).
- Possibility and impossibility
  Lundelius, J. and Lynch, N., "An Upper and Lower Bound for Clock Synchronization," Information and Control, Vol. 62, Nos. 2/3, pp. 190-204, 1984.
  Danny Dolev, Joe Halpern, and H. Raymond Strong, "On the possibility and impossibility of achieving clock synchronization", Journal of Computer and System Sciences 32(3) 230-250. April 1986.
  Michael J. Fischer, Nancy A. Lynch, and Michael Merritt, "Easy impossibility proofs for distributed consensus problems" Proceedings of the fourth annual symposium on Principles of distributed computing 1985 , Minaki, Ontario, Canada.
- Practical solution: NTP (Reading)
Other Reading:
Leslie Lamport and P. M. Melliar-Smith, "Synchronizing clocks in the presence of faults" Journal of the ACM 32(1) (January 1985).
Jennifer Lundelius and Nancy Lynch, "A new fault-tolerant algorithm for clock synchronization, Proceedings of the third annual ACM symposium on Principles of distributed computing 1984 , Vancouver, British Columbia, Canada.

First project due February 11.
Process Synchronization
- Overview: Global State, Mutual Exclusion
  Leslie Lamport, ``The Mutual Exclusion Problem'', Journal of the ACM 33(2) (April 1986). Read Part II section 2 - the rest is optional.
  Leslie Lamport, ``1983 Invited address: Solved problems, unsolved problems and non-problems in concurrency, Proceedings of the third annual ACM symposium on Principles of distributed computing, 1984, Vancouver, British Columbia, Canada.
  Optional - Global State:
  K. Mani Chandy and Leslie Lamport, ``Distributed Snapshots: Determining Global States of Distributed Sytems'', ACM Transactions on Computer Systems 3(1) (February 1985) 63-75.
- Fault Tolerant Solutions
  Michael J. Fischer, Nancy A. Lynch, James E. Burns and Allan Borodin, ``Distributed FIFO allocation of identical resources using small shared space'' ACM Transactions on Programming Languages and Systems 11(1) (1989) pp. 90-114.
- Multiple resources Requirements
  Please don't check these out - others may want to read them.
  Dijkstra, E. ``Hierarchical Ordering of Sequential Processes'', ACTA Informatica 1 (1971), 115-138.
  M. Rabin and D. Lehmann, ``On the Advantages of Free Choice: A Symmetric and Fully Distributed Solution to the Dining Philosophers Problem'', Proceedings of the 8th Symposium on Principles of Programming Languagues (1981) pp. 133-138.
Second project starts February 15.
Distributed Transactions
- 2-Phase Commit
- Formal Models for failure and recovery
- 3-Phase Commit
Reading:
Skeen, Dale, ``A Formal Model of Crash Recovery in a Distributed System,'' IEEE Transactions on Software Engineering 9(3), May 1983, pp.219-228. (preliminary on-line version from SIGMOD'81)
Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Concurrency Control and Recovery in Database Systems, Chapter 7: Distributed Recovery, Addison Wesley, 1987.
Distributed Data: Replication
- Basics
  Reading:
  Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Concurrency Control and Recovery in Database Systems, Chapter 8: Replicated Data, Addison Wesley, 1987.
- Example: Replication in Oracle
- Advanced Techniques: Quasi-Copies
  Reading:
  Rafael Alonso, Daniel Barbará, and Hector Garcia-Molina, ``Data caching issues in an information retrieval system'', ACM Transactions on Database Systems (TODS) 15(3), September 1990.
Second project due March 1.
Mid-Semester Review
March 8, in class: Midterm on material from weeks 1-7.
Processes, code migration
Third project starts March 20.
Distributed Object systems: CORBA (OMG)
Reading: CORBA Overview from The Common Object Request Broker: Architecture and Specification, OMG group, 2001.
CORBA Security Service (reading).
Third project due April 3, fourth project starts.
Distributed Object Systems:
- DCOM
  Reading: DCOM vs. .NET
- .NET
Distributed Coordination: Jini. Further reading: Jan Newmarch's Guide to JINI Technologies.
Fault Tolerance
- Failure models. Reading:
  Dr. Flaviu Cristian, Understanding Fault-Tolerant Distributed Systems, Communications of the ACM 34(2) February 1991.
- Fault Tolerance Reading:
  Felix C. Gärtner, Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments ACM Computing Surveys 31(1), March 1999.
- Reliable communication
- Recovery Optional reading:
  Richard Golding and Elizabeth Borowsky, Fault-Tolerant Replication Management in Large-Scale Distributed Storage Systems, in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems 18-21 October, 1999, Lausanne, Switzerland.
  Hector Garcia-Molina, Christos A. Polyzois and Robert B. Hagmann, Two Epoch Algorithms for Disaster Recovery, in Proceedings of the 1990 conference on Very Large Data Bases, Brisbane, Australia, August 13-16 1990.
Fourth project due April 19.
Review

Final exam Thursday, May 2, 2002 from 1:00pm to 3:00pm in RHPH 164.