Advanced Topics in Distributed Systems |
Monday, Wednesday, and Friday from 1:30-2:20 |
RHPH 162 (note change) |
|
Email:  |
Design and control of distributed computing systems (operating systems and database systems). Topics include principles of naming and location, atomicity, resource sharing, concurrency control and other synchronization, deadlock detection and avoidance, security, distributed data access and control, integration of operating systems and computer networks, distributed systems design, consistency control, and fault tolerance.
A more detailed course description prepared for
the CEE
program is available,
as is a course preview briefing containing
more detailed information on requirements and expectations. The
course outline is given below.
More course information may be available in
WebCT
(direct link).
Please add yourself to the course mailing list.
Send mail to mailer@cs.purdue.edu containing the line:
add your email to cs603
Feel free to send things to the course mailing list if you feel
it is appropriate. An example might be a pointer to a particularly
helpful on-line manual describing an API used in one of the projects.
Course Methodology
The course will be taught through lectures, with class
participation expected and encouraged. There will be frequent
reading assignments to supplement the lectures.
For now, Professor Clifton will not have regular office hours.
Feel free to drop by anytime, or send email with some suggested
times to schedule an appointment. You can also try H.323/T.120
desktop videoconferencing (e.g.,
SunForum,
Microsoft NetMeeting.)
You can try opening an H.323 connection to blitz.cs.purdue.edu - send email
if there is no response.
Prerequisites
The official requirement is CS 503 (Operating systems), with
CS 542 (Distributed Database systems) recommended. The practical
requirement is a solid undergraduate background in computer science
including some database and operating systems theory, and substantial
programming experience. If you don't have 503, but feel you have
sufficient background, please send me an explanation of why you
feel you are prepared, along with a number/times for me to call
and discuss approving your registration.
Text
The following is recommended (it will be a useful reference for
much of the lab work in the course):
Internetworking with TCP/IP Vol.III: Client-Server Programming and Applications,
D. E. Comer and D. Stevens,
Prentice Hall,
(choose appropriate version for your favorite platform),
0-13-032071-4
The following have been recommended in the past, and may provided
useful background reading. However, none are required.
Distributed Systems, 1993
Sape Mullender
Prentice Hall
0-201-62427-3
Distributed Algorithms, 1997
Nancy Lynch
Morgan Kaufmann
1-55860-348-4
Distributed Operating Systems, 1995
Tanenbaum
Prentice Hall
0-13-219908-4
Evaluation/Grading:
Evaluation will be a subjective process, however it will be based primarily on
your understanding of the material as evidenced in:
- Midterm Exam (25%)
- Final Exam (35%)
- Projects (4-5) (40%)
Exams will be open note / open book.
To avoid a disparity between resources available to different students,
electronic aids are not permitted.
(If everyone has a notebook with wireless connection and
all agree they want to use them in the exams, I could relax this.)
I will evaluate projects on a five point scale:
- 5
- Exceptional work. So good that it makes up for substandard
work elsewhere in the course. These will be rare.
- 4
- What I'd expect of a Ph.D. candidate. This corresponds
to an A grade.
- 3
- Good enough for a Master's degree, but not what I'd like
to see for a Ph.D. candidate. This corresponds to a B grade.
- 2
- Okay for a Master's candidate who does extremely well
in other courses. This corresponds to a C grade.
- 1
- Not good enough for a graduate student. But something.
- 0
- Missing work, or so bad that you needn't have bothered.
Projects
A substantial portion of your education in this course will come through
performing programming projects: building components of a distributed
system. Some examples of what projects might involve are:
- Building a server capable of handling multiple simultaneous TCP/IP
connections using the Socket API. The server would be trivial (e.g.,
calculate the square of the input and return the result after a five
second delay), the key effort would be the API.
-
Implement an application that connects to a (provided) CORBA server.
-
Implement a clock synchronization protocol.
My current expectation is that all projects will be done individually,
as it is probable that some of the CEE students will not be collocated
with other students in the course.
Please read the above link to the policy written by
Professor Spafford.
This will be followed unless I provide written documentation of exceptions.
Late work will be penalized except in case of documented emergency
(e.g., medical emergency), or by prior arrangement if doing the work
in advance is impossible due to fault of the instructor (e.g., you
are going to a conference and ask to start the project early, but
I don't have it ready yet.)
The penalty for late work is 1 point (of the possible 5) if turned
in after the deadline, and one additional point for each week late.
Syllabus (numbers correspond to week):
Project start/due dates are tentative!
- Course overview,
Components of a distributed system
- Communication Mechanisms
- Message Passing
- Stream-oriented communications
- Remote Procedure Call
- Remote Method Invocation
- Remote Method Invocation: Mechanisms
First project starts January 23
- Naming
First project design due January 30
- Clock Synchronization
- What is clock synchronization?
Leslie Lamport,
"Time, clocks, and the ordering of events in a distributed system",
Communications of the ACM 21(7) (July 1978).
- Possibility and impossibility
Lundelius, J. and Lynch, N., "An Upper and Lower Bound for Clock Synchronization," Information and Control, Vol. 62, Nos. 2/3, pp. 190-204, 1984.
Danny Dolev, Joe Halpern, and H. Raymond Strong,
"On the possibility and impossibility of achieving clock synchronization",
Journal of Computer and System Sciences 32(3) 230-250. April 1986.
Michael J. Fischer, Nancy A. Lynch, and Michael Merritt,
"Easy impossibility proofs for distributed consensus problems"
Proceedings of the fourth annual symposium on Principles of distributed computing
1985 , Minaki, Ontario, Canada.
- Practical solution: NTP
(Reading)
Other Reading:
Leslie Lamport and P. M. Melliar-Smith,
"Synchronizing clocks in the presence of faults"
Journal of the ACM 32(1) (January 1985).
Jennifer Lundelius and Nancy Lynch,
"A new fault-tolerant algorithm for clock synchronization,
Proceedings of the third annual ACM symposium on Principles of distributed computing
1984 , Vancouver, British Columbia, Canada.
First project due February 11.
- Process Synchronization
- Overview: Global State, Mutual Exclusion
Leslie Lamport, ``The Mutual Exclusion Problem'',
Journal of the ACM 33(2) (April 1986).
Read Part II section 2 - the rest is optional.
Leslie Lamport,
``1983
Invited address: Solved problems, unsolved problems and non-problems in concurrency,
Proceedings of the third annual ACM symposium on Principles of distributed computing,
1984, Vancouver, British Columbia, Canada.
Optional - Global State:
K. Mani Chandy and Leslie Lamport,
``Distributed
Snapshots: Determining Global States of Distributed Sytems'',
ACM Transactions on Computer Systems 3(1) (February 1985) 63-75.
- Fault Tolerant Solutions
Michael J. Fischer, Nancy A. Lynch, James E. Burns and Allan Borodin,
``Distributed
FIFO allocation of identical resources using small shared space''
ACM Transactions on Programming Languages and Systems 11(1) (1989)
pp. 90-114.
- Multiple resources Requirements
Please don't check these out - others may want to read them.
Dijkstra, E. ``Hierarchical Ordering of Sequential Processes'',
ACTA Informatica 1 (1971), 115-138.
M. Rabin and D. Lehmann,
``On the Advantages of Free Choice:
A Symmetric and Fully Distributed Solution to the Dining Philosophers Problem'',
Proceedings of the 8th Symposium on Principles of Programming Languagues
(1981) pp. 133-138.
Second project starts February 15.
- Distributed Transactions
Reading:
Skeen, Dale, ``A Formal Model of Crash Recovery in a Distributed System,''
IEEE Transactions on Software Engineering 9(3),
May 1983, pp.219-228.
(preliminary on-line version from SIGMOD'81)
Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman,
Concurrency Control and Recovery in Database Systems,
Chapter 7: Distributed Recovery,
Addison Wesley, 1987.
- Distributed Data: Replication
- Basics
Reading:
Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman,
Concurrency Control and Recovery in Database Systems,
Chapter 8: Replicated Data,
Addison Wesley, 1987.
- Example: Replication in Oracle
- Advanced Techniques: Quasi-Copies
Reading:
Rafael Alonso,
Daniel Barbará,
and Hector Garcia-Molina,
``Data caching issues in an information retrieval system'',
ACM Transactions on Database Systems (TODS)
15(3), September 1990.
Second project due March 1.
- Mid-Semester Review
March 8, in class: Midterm on material from weeks 1-7.
- Processes, code migration
Third project starts March 20.
- Distributed Object systems: CORBA (OMG)
Reading: CORBA Overview from
The Common Object Request Broker: Architecture and Specification, OMG group, 2001.
CORBA Security Service
(reading).
Third project due April 3,
fourth project starts.
- Distributed Object Systems:
- Distributed Coordination: Jini.
Further reading: Jan Newmarch's Guide to JINI Technologies.
- Fault Tolerance
- Failure models. Reading:
Dr. Flaviu Cristian,
Understanding Fault-Tolerant Distributed Systems,
Communications of the ACM 34(2) February 1991.
- Fault Tolerance Reading:
Felix C. Gärtner,
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments
ACM Computing Surveys 31(1), March 1999.
- Reliable communication
- Recovery
Optional reading:
Richard Golding and Elizabeth Borowsky,
Fault-Tolerant Replication Management in Large-Scale Distributed Storage Systems,
in
Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
18-21 October, 1999, Lausanne, Switzerland.
Hector Garcia-Molina, Christos A. Polyzois and Robert B. Hagmann,
Two Epoch Algorithms for Disaster Recovery,
in Proceedings of the
1990 conference on Very Large Data Bases,
Brisbane, Australia, August 13-16 1990.
Fourth project due April 19.
-
Review
Final exam
Thursday, May 2, 2002 from 1:00pm to 3:00pm in RHPH 164.