TDT4225 Very Large, Distributed Data Volumes




These are my notes for TDT4225 . These are my own understandings of the subject and it may not be completely correct or accurate. It is also written in slightly below average Norwegian English.

Why do I post them here?

There is a couple of reasons. The first is for convenience. I like to write Markdown as it is very easy to structure text easily and it does not take any focus away from the text, it also has great readability both uncompiled and compiled. It is also very practical to have this available on the web as i like to read my notes on my iPad, and it is faster to access here. The last reason is that someone may find them useful.

Google LevelDB

Chapter: 2.1, 2.2, 2.3, 2.6, 4.1-4.4


  • in-memory data structure
  • sorted by keys
  • key/value pairs or deletion markers
  • iterators
  • kept small, a few megabytes


  • Sorted string table
  • Sequence of entries sorted by key
  • key/value pairs or deletion markers
  • index at end
  • Kept small, typically 2 MB


Organized in levels:


Memtable and Sstable

Write path

Read path

Snapshots and iterators



Recovery and durability

The log contains complete records, which makes things recoverable.

The log may be disabled to increase performance.

Sstables are immutable and only new files are created during compaction.

Symas LMDB

Paper: Christian Forfang, Evaluation of High-Performance Key-Value Stores Chapter: 3.1, 3.2, 3.3, 3.5, 4.1-4.4

Developed for OpenLDAP as a replacement for BerkeleyDB.

Key/value store implemented using B+-trees.

Ordered map interface for range queries.

Transactions through MVCC (multi-version concurrency control)

Writers don't block readers

Readers don't block writers

Memory-mapped files, recovery free restart

No background compaction

CoW B+tree

Write transactions create a new version of the tree so that readers can use the old version. New version becomes current when write is finished.

New transactions are redirected to the new tree. Existing transactions may use the old tree.

Only one write transaction at a time.

When pages become unavailable, they are marked as reusable in a subdatabase.

Neighbour pages may be spread around on disk...

Peer-to-peer systems

Book: Distributed Systems - Concepts and Design Chapter: 10


First really big filesharing application. Quickly grew. Used some central servers for indexing. After initial query, the data was directly shared between clients.


Time and global states

Book: Distributed Systems - Concepts and Design Chapter: 14

Skew: Diff at a given time

Drift: Diff over time

Physical time

  • Point in time
  • Can be used to figure out order
  • Needs synchronized clocks

Perfect synchronization is impossible.

Logical time

  • Ordering of events
  • Ordering in focus
  • Usualy implemented using a counter

If the order of events is the only important thing, physical time is overkill.

Lamport clock

Vector clocks


  • Very accurate
  • Based on Atomic clock
  • Adjusted based on the earths rotation
  • Timezones are offsets of UTC
  • Distributed via GPS (1micros) and ground (1ms)


External sync: Distributed system synchronizing against external sources (e.g GPS)

Internal sync: Distributed system synchronizing internally in a distributed system. This does not necessarily give the correct time.

Problem: Communication takes time.

Christians algorithm

Implementation of External sync.

External time server S Local server p

  1. p sends a request to S
  2. S sends a response with the current time t
  3. When p gets response, set clock to time t + half the time since initial request.

Berkeley algorithm

Implementation of Internal sync.

One master, rest slaves

  1. Master polls all slaves
  2. Slaves responds with current clock
  3. Master calculates the average clock (ignores big diffs, takes transfer time into account)
  4. Master sends out individual diffs to every slave.

Network Time Protocol (NTP)

Protocol for time synchronization over the internet. Uses UTC.

Focus: - Security - Scalability - Correct - Reliability

The NTP servers i built up in a hierarchy where the root node is directly connected to a UTC source and the leaf nodes are client machines.

Synchronization: - Multicast - Procedure call (like Christians algorithm) - Symmetric mode (Communication between different servers, high accuracy)

Global state

Why: - Distributed garbage collection - Distributed deadlock detection - Distributed debugging

How? Cuts and global consistent states.

Distributed garbage collection

Objects without a active reference is garbage.

Reference: - Local - At other nodes/hosts/processes - In messages

Distributed deadlock detection

wait-for loop

Distributed debugging


A subset of the global history.

Consistent Cut

For all events e in cut C......

Coordination and agreement

Book: Distributed Systems - Concepts and Design Chapter: 15


Reliable channel: The message always makes it through, it is not changed and faults are handled before the end system.

Async system: Cannot make assumptions of time.

Sync system: - Max time of message transfer - Max time of instruction execution - Timeouts to detect errors

Distributed Mutual exclusion

Multiple processes shares a resource.

Now, distributed with muliple machines, uses messages to communicate.

Algorithms: - Central server - Token-ring - Multicast - Polling

Central server

All processes ask for access. The access is granted to one process at a time. The server can quickly become a bottleneck.

Messagetypes: request, grant and release.


A token is sent throug the ring, the process which holds the token can access the critical resource. The token is located at the same process during the entier operation.

Needs 0 to N messages to enter, 1 message to exit.


Needs 2(N-1) messages to enter and 0 messages to exit.


Needs 2sqrt(N) to enter and sqrt(N) to exit. Better than multicast if N > 4.

File systems

Book: Operating Systems in Depth Chapter: 6






Improving performance


Fast storage for storing full tracks of information. Improves reads, not writes.

Larger blocks, improved layout

Major innovations of FFS/UFS.

Problem with larger blocks is wasted space. FFS used complicated strategy, split blocks into fragments, where a fragment where two sectors. The FS must keep track of all the free blocks, but also the partially filled blocks. Files must be put in continouos fragments.

FFS tries to arrange tings so files can be expanded.

Optimization-for-time policy

As long as there is plenty of free blocks, whole allocation is used.

Optimization-for-space policy

Allocating partial blocks containing just the number of fragments needed and pays the price for copying when files grow.

Keep data blocks close to the corresponding inodes on disk. Keep inodes for directories close to one another.

Regions / Cylinder groups / Block groups

A group contains a space set aside for inodes as well as a space for data and indirect blocks.

It is important to identify cylinder groups with plenty of free space. FFS uses quadratic hashing to quickly find groups with sufficient free space and allocates groups uniformly.

Reducing rotational latency

Reducing latency is done by placing successive blocks one block apart. This technique is called block interleaving.

Clustering and extents

Block clustering

Used in ext2 and later versions of FFS. Instead of using block interleaving, clustering is done by grouping blocks together so that more can be read in a sequence. The preallocations is stashed together as fixed-sized blocks.


Files are treated as collections of larger contiguous regions of disk space called extents. One large file can consist of a single extent, allowing fast access with little metadata. External fragmentation is a problem.

Aggressive Caching and Optimized writes

Soft updates

Journaled FS

Shadow-paged file system / Copy-on-write

Amazon Dynamo


Facebook TAO




Google Spanner