Distributed Systems

CSCI189A PO-01 — Spring 2016 — Pomona College


This schedule is tentative and subject to change during the semester.

Date Lecture Reading Assignments
W 1/20 Introduction to distributed systems 1.1, 1.2  
M 1/25 Communication: Internet 1 2.1, Design philosophy of the Internet  
W 1/27 Communication: Internet 2   Homework 1
M 2/1 Introduction to Go Intro to programming in Go Chapters 1-7  
W 2/3 Concurrency in Go Intro to programming in Go Chapters 8-14
Concurrency notes by Kesden
M 2/8 Communication: Remote procedure calls 4.1, 4.2  
W 2/10 No class    
M 2/15 Coordination: Time 6.1, 6.2 Project 1 due 2/26 at 11:59pm on Sakai
W 2/17 Coordination: Mutual exclusion
Coordination: Leader election
6.3, 6.5 Homework 2 due 2/29 at 11:59pm on Sakai
M 2/22 Coordination: Distributed hash tables    
W 2/24 Fault tolerance: Error detection and correction    
M 2/29 Fault tolerance: RAID   Homework 3 due 3/11 at 11:59pm on Sakai
W 3/2 Concurrency control 8.5 Project 2
M 3/7 Fault-tolerance: logging and crash recovery 8.6  
W 3/9 Exam 1    
  Spring recess    
M 3/21 Case study: distributed file systems    
W 3/23 Replication: sequential consistency    
M 3/28 Replication: causal, eventual consistency    
W 3/30 Replication: strong eventual consistency, CRDTs   Homework 4 due 4/8 at 11:59pm on Sakai
M 4/4 Consensus: 2PC and 3PC 8.5  
W 4/6 Consensus: Paxos Paxos made simple Project 2 part A due 4/6 at 11:59pm on Sakai
M 4/11 Case study: Chubby lock service The Chubby Lock Service for Loosely-Coupled Distributed Systems  
W 4/13 Security and cryptography 9.1, 9.2, 9.4.1  
M 4/18 MapReduce, Pregel    
W 4/20 Case study: Bitcoin Michael Nielsen’s blog post Homework 5 due 4/20 at 11:59pm on Sakai
M 4/25 Multicast and membership 4.5.2., 8.2.4  
W 4/27 Distributed snapshots An introduction to snapshot algorithms in distributed computing Project 2 part B due 4/27 at 11:59pm on Sakai
Homework 6 due 4/29 at 11:59pm on Sakai
M 5/2 Presentations    
W 5/4 Presentations   Exam 2 due 5/5 at 11:59pm on Sakai
  Finals week: no meeting    

Introduction to distributed systems

What is a distributed system?

Why distributed systems?

Design challenges

Key approaches

Projects in this class

Project 1: multi-echo server

Project 2: Bitcoin miner

Project 3: Tribbler

Project 4: design your own distributed service


↑ back to the top

Internet 1

Top level goal: “develop an effective technique for multiplexed utilization of existing interconnected networks”

Design philosophy of the DARPA internet protocols (1988)

Secondary goals

Note: priorities for a military use, but would be different for commercial use

Internet communication must continue despite loss of network or gateways

The Internet must support multiple types of communication service.

The Internet architecture must permit distributed management of its resources

The Internet architecture must be cost effective.

↑ back to the top

Internet 2

Review question

Of TCP and UDP, which is most similar to postal mail? to the telephone service?

What are the similarities?

Review question 2

Why datagrams?

Internet wrap up

Network diagnostic tools

Coping with addresses running out


The Internet is not secure:


Web architecture

Acknowledgement: this section is based on the lecture notes by Roxana Geambasu, Peter Du, Yu Qiao.

External caching

Front-end tier: webservers

Application server tier

Database tier

Internal caching tier

Miscellaneous services



↑ back to the top

Remote Procedure calls

Why remote procedure calls?

How it works under the hood

How it’s different from a local procedure call


↑ back to the top


Synchronizing clocks

Measuring time


Why synchronize clocks in a distributed system?

The formal synchronization problem

Solution 1: Christian’s algorithm

Solution 2: Berkeley algorithm

Solution 3: NTP

Logical clocks

Lamport timestamps (1978)

Vector clocks

↑ back to the top

Last time: crucial aspect of coordinated work: achieving global state
Today: two other aspects of coordinated work: reservation of shared resources and selecting a coordinator

Mutual exclusion

Solution 1: Token ring

Solution 2: Centralized algorithm

Solution 3: Decentralized algorithm (Lin et al. 2004) see homework

Solution 4: Distributed algorithm (Ricart and Agrawala 1981)

Comparing the 4 solutions

Leader election

Why? (applications)

What? (problem 1)

Solution 1: ring algorithm

Solution 2: bully algorithm (Garcia-Molina, 1982)

Unrealistic assumptions of the problem

Problem: election in a wireless environment

Solution (proposed by Vasudevan et al. 2004)

Problem: election in a P2P network



↑ back to the top

Distributed hash tables

Chord P2P DHT or Tannenbaum 5.2.3

Intro: caching problem

Stable hashing

↑ back to the top

Error detection and correction


Error detection and correction

↑ back to the top


Why use multiple disks?

Simplest way of using multiple disks: JBOD

Performance and load-balancing


RAID - redundant array of inexpensive/independent disks

Reliability in numbers

↑ back to the top

Concurrency control

Problem: how to implement concurrent bank transfers in a distributed fashion?

transfer(val, i, j):
  if withdraw(val, i):
    deposit(val, j)

withdraw(val, i):
  if bal[i] >= val
    bal[i] -= val
    return true
  return false

deposit(val, j):
  bal[j] += val

Database concepts

Solutions on a single server

  1. global lock: causes bottleneck
  2. one lock per account, wrapping lock around each operation separately: not isolated
  3. acquire both locks, implement transfer as usual, release locks: can deadlock

Coping with deadlocks

  1. timeouts: but these can lead to livelocks
  2. build wait-for graph (nodes are processes) and break the cycles
  3. order the nodes of the graph

The 2 phases of a 2-phase lock:

  1. acquire the locks
  2. release the locks

Solution viewed as a 2-phase commit:

  1. preparation: acquire locks, decide on list of update operations
  2. action: commit or abort

Solution in a distributed system

↑ back to the top

Logging and crash recovery


Very simplified view of databases

Solution 1: shadow writing

Solution 2: write-ahead logging

↑ back to the top

Distributed file systems

Goal: have a consistent namespace for files across computer and let authorized users access files on any computer


Other goals: interface, fault tolerance, scalability, concurrency, security

Design approach

Naive design




↑ back to the top

Sequential consistency




Strict consistency

Sequential consistency

The CAP theorem

↑ back to the top

Causal and eventual consistency

Strong/strict consistency: after update completes, any subsequent access returns updated value. Achieved through mutual exclusion.

Weak consistency: system not guaranteed to return updated value after update is complete. A number of conditions must be satisfied.

Eventual consistency: specific form of weak consistency. Guarantee that if no new updates made to object, eventually all accesses will return last updated value. If no failure occurs, max size of inconsistency window is a function of communication delays, load on system and number of replicas.


Causal consistency: operations from the same process with independent causal chains can execute in relative order. Example: comments from a blog post.

Read-your-writes: after a process has updated a value always sees the new value, not any older ones.

Session consistency: process accesses storage system in context of a session.

Monotonic write: serialized writes by the same process, otherwise hard to program

Monotonic read: if a process sees a particular value for an object subsequent access will not return a previous value.

Sequential vs eventual

↑ back to the top

Strong eventual consistency and CRDTs

What are CRDTs?

CRDT types







Strong eventual consistency

Strong consistency

Eventual consistency

Strong eventual consistency

Formal definition of eventual consistency

Formal definition of strong eventual consistency


(An analogous result exists for operation-based updates.)

Observed-remove set

Example of observed-remove set with add-wins rule (Idea is to internally keep track of multiple copies of a element added at different nodes)


↑ back to the top

2PC and 3PC

Consensus problem




The protocol (recall from previous discussion) and recovery mechanism

2PC is a blocking protocol

2PC is safe but not live


Idea: 2PC’s problem stems from allowing commits without all parties knowing of the decision. Separate these into two phases.

The protocol and recovery mechanism

3PC solves the blocking problem

3PC is not safe

FLP impossibility result

↑ back to the top


Problem of consensus

Failure model

Idea of Paxos




↑ back to the top

The Chubby lock service

Responsibilities of a distributed lock manager


Architecture diagram

Chubby cell

Electing a master

Locating the master

File system interface:

Differences from UNIX file systems:

Files as locks



Example: electing a primary

Cache (server-side)

Scaling mechanisms

Study of use as DNS

Abusive uses

↑ back to the top

Security and cryptography

Internet security weaknesses

Secure channel

Symmetric-key encryption

Public-key encryption

Key distribution

↑ back to the top

MapReduce and Pregel


Etymology: from functional programming

Map-Reduce functions in Hadoop

Need to be able to group values of the same key, use of consistent hashinng

Example: word counter

Example: reverse weblink graph

Example: word frequency

Example: sort by value

Implementation details


Graph processing

Examples of graphs: Internet, WWW, social, biology

Graph algorithms

Processing steps (outline)

Example: PageRank

Example: Shortest Path

Why not use MapReduce?


Pregel execution

  1. master sends workers workload
  2. workers fetch input data from persistent storage (eg GFS)
  3. (repeat the following)
  4. master signals to workers beginning of iteration
  5. worker receives messsages, updates local value, sends messages

Pregel API

Bulk synchronization programming model (Valiant 1980s)

↑ back to the top


Bitcoin is a peer-to-peer digital currency.

Problem: how to prevent forging?

Solution: use digital signature to sign letter of intent, eg “I, Alice, give Bob one bitcoin,” signed by Alice.

Problem: this solves the problem of forging the letter, but not replay attacks by a 3rd party or Alice double-spending.

Solution: use a central bank to keep track of everyone’s balance, issuing serialization numbers for transactions or the bitcoins themselves and keep track of all transactions in a public ledger. (Example scenario)

Problem: must trust a central authority.

Solution: make everyone using the currency take on the bank’s responsibilities. The public ledger is a complete record of all transactions, called the block chain. Everyone has a copy of it. (Example scenario)

Problem: if everyone only checks against their own copy of the block chain in order to accept money from Alice, Alice can still double-spend. What should be the protocol for validating a transaction now the bank is decentralized, ie., now that the block-chain is replicated? And what kind of consistency should we use?

Solution: to validate a transaction, maybe validate against a certain number of other people’s block chains?

Problem: to double-spend, Alice could create multiple accounts in order to validate an invalid transaction.

Solution: ok, then make the process of validating a transaction computationally expensive by requiring the solution of to a puzzle (called proof of work) with the validation. Finding the solution must be computationally expensive, but verifying must not so others can validate your work. (example scenario). See project 2 Part B for details of puzzle. The difficulty of a computation can be controlled.

Problem: why would anybody want to do this work?

Solution: reward those who validate transactions by giving newly created currency or via transactions fees. Currently, 25 newly minted bitcoins awarded for every validated block of transactions. Reward is halved every 210,000 validated blocks. Eventually, will no longer be paid in new bitcoins, but in transaction fees. Those who validate transactions are called miners.

Problem: still did not address consistency. How is order determined?

Solution: New blocks of validated transactions have a pointer to last validated block on the block chain, creating a DAG of alternative histories of validated transactions (Forks can happen if different miners successfully create a new block at around the same time). The rule is for all miners to work on the longest fork where length is determined by total difficulty for producing that chain. A transaction is considered confirmed if it is part of the longest fork and at least 5 blocks follow it in the longest fork.

Security: Alice tries to get Bob and Charlie to accept the same bitcoin

Problem: who serializes the bitcoins?

Solution: don’t need to. Each transaction consists of an input and multiple outputs. Inputs point to an output of a previous transaction (source of fund) and outputs point to one or more recipient (specifically a hash of the recipient’s public key), and the paying party (mechanism for making change). This, together with the ledger, allows for all funds to be tracked.

Open questions

↑ back to the top

Multicast and membership

The multicast problem

Solution 1: centralized approach

Solution 2: Tree-based approach

Solution 3: Gossip protocol

Variant 1: push style

Variant 2: pull style

Variant 3: push-then-pull

Variant: Topology-aware gossip


Problem solved by a membership protocol

Desirable properties of failure detectors

Centralized heartbeating

Ring heartbeating

All-to-all hearbeating

Gossip-style heartbeating

SWIM failure detector

↑ back to the top

Distributed snapshots


Running example


Formal definition

Question: how can a channel’s state be recorded?

Chandy-Lamport algorithm for FIFO channels


Proof of correctness of the algorithm

Implementation detail: collecting a global snapshot

Lai-Yang algorithm for non-FIFO channels

↑ back to the top