Home » Teaching » CPSC 416 (Winter 22 Term 2)

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
November 2024
S M T W T F S
 12
3456789
10111213141516
17181920212223
24252627282930

CPSC 416 (Winter 22 Term 2)

Starting in January 2023 I will be teaching the Computer Science Department’s course in Distributed Systems (CPSC 416). This course has been taught by Ivan Bestchastnikh and it means I have an interesting challenge ahead in filling his shoes. The course I am teaching will be based upon his prior course, albeit with some modifications that I bring from my own experience as part of the instructional team for Georgia Tech’s CS 7210 course.

First: this course is challenging. It is not gratuitously challenging (e.g., I won’t make you suffer because I suffered and think you somehow need to “pay your dues.” My goal is to help you build a mental model for understanding the field but the field is difficult.

Distributed Systems are difficult because:

  • The primary concern for this field is failure. Most of the courses you have taken teach you how to do things. Frequently “failure” means killing the program, rebooting the computer, fixing the bug in the code. In distributed systems we have to figure out how to make things work in the presence of failure.
  • A major challenge is that, unlike programs running on a single computer, there is limited fate sharing. In other words, parts of the system can keep running while other parts have failed. When different parts of the system do not have the same view of reality we lose the ability to know what the “correct” state of the system is.
  • The most common technique we use is to construct consensus mechanisms – getting different independent parts of the system to agree on how to handle failures. The benefit of this is that if we do it right we can project the aura of a single, coherent system capable of providing concrete guarantees to our “customers.”

The dirty secret is that, in fact, we cannot guarantee the ability to handle all possible failures. A concrete example of this is the Two Generals Problem. Thus, what we do in distributed systems is provide specific guarantees within some defined set of failures.

Why do distributed systems matter? Because, while they are not 100% proof against all possible failures they can achieve high – but not absolute – reliability. Cloud compute vendors use the techniques we will discuss in this course to build “high availability systems.” Their failure model extends to protecting against regional level disasters.

Over the term I will be updating the materials I have provided here as necessary: