Home » Teaching » CPSC 416 Winter 2023 Term 1

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
November 2024
S M T W T F S
 12
3456789
10111213141516
17181920212223
24252627282930

CPSC 416 Winter 2023 Term 1

Note: This is preliminary and will be changing over the next couple of weeks. The course starts September 5, 2023. The first in-person lecture is September 7, 2023.

The first time that I taught this course (Distributed Systems (CPSC 416)) was January to April 2023 (UBC called it Winter 2022 Term 2). This course will differ from my prior offering, so while you are free to review the materials from my prior offering, do not expect that this offering will be the same.

First: this course is challenging. It is not gratuitously challenging (e.g., I won’t make you suffer because suffered and think you somehow need to “pay your dues.”) My goal is to help you build a mental model for understanding the field but the field is difficult.

Distributed Systems are difficult because:

  • The primary concern for this field is dealing with failure. Most of the courses you have taken teach you how to do things. Frequently “failure” means killing the program, rebooting the computer, fixing the bug in the code. In distributed systems we have to figure out how to make things work in the presence of failure.
  • A major challenge is that, unlike programs running on a single computer, there is limited fate sharing. In other words, parts of the system can keep running while other parts have failed. When different parts of the system do not have the same view of reality we lose the ability to know what the “correct” state of the system is.
  • We care about persistent state. What this means is that, unlike ephemeral state where we can reboot to recover, any errors we make will be there after we reboot. That’s why this version of the course is organized around the two primary ways we store persistent data: databases and file systems. They’re really mostly the same, with different access interfaces. Even databases come in a variety of flavours, depending upon the access model for which they optimize (e.g., SQL, NoSQL, Graph, Vector, etc.)
  • We are primarily interested in just two core problems:
    • How do we ensure that changes to two different databases that are related to each other are done in a consistent fashion. In other words, if and only if the bank machine gives you the cash you requested should your account balance be adjusted. For this we use transactional consistency mechanisms.
    • How do we ensure that our databases can recover from the failure of a single copy (or instance) of the database. For this we use consensus algorithms.
  • We can easily develop working solutions that are not practically usable because they are too slow. The complications arise because we must both optimize our techniques and preserve our correctness.

We cannot protect against all possible failures. What we do, instead, is identify failures we do want to protect against. Over time, we can expand on the failures we can handle. Of course, it turns out that figuring out if you did it right is hard. It’s like backups: the time you figure out they didn’t work is when you need them. Failure at that point often means data loss. There are numerous articles that describe the impact of data loss on organizations. Here’s a good quotation from one: “[M]any companies don’t understand what constitutes their most critical data and how to protect it.

Why do distributed systems matter? Because, while they are not 100% proof against all possible failures they can achieve high – but not absolute – reliability. Cloud compute vendors use the techniques we will discuss in this course to build “high availability systems.” Their failure model extends to protecting against regional level disasters.

During the term I will be updating the materials I have provided here as necessary: