DDIA - 2025

Posted on 2025-05-03 Edited on 2025-06-14 In Learning

DDIA

Chapter 1: Reliable, Scalable, and Maintainable Applications

A data-intensive application is typically built from standard building blocks. They usually need to:

Store data (databases)
Speed up reads (caches)
Search data (search indexes)
Send a message to another process asynchronously (stream processing)
Periodically crunch data (batch processing)

The ultimate reason for all the difference databases: different applications have different requirements.

Thinking about Data Systems

Note: Kafka and the system we are building is very similar

Reliability. To work correctly even in the face of adversity.
Scalability. Reasonable ways of dealing with growth.
Maintainability. Be able to work on it productively.

Reliability

A fault is usually defined as one component of the system deviating from its spec, whereas failure is when the system as a whole stops providing the required service to the user.

Note: check the Netflix chaos monkey

Hardware faults

There is a move towards systems that tolerate the loss of entire machines. A system that tolerates machine failure can be patched one node at a time, without downtime of the entire system (rolling upgrade).

Note: talk about the number of instance failure in our system

Software Errors

Software errors. It is unlikely that a large number of hardware components will fail at the same time. Software errors are a systematic error within the system, they tend to cause many more system failures than uncorrelated hardware faults.

Human Errors

Human errors. Humans are known to be unreliable. Configuration errors by operators are a leading cause of outages. You can make systems more reliable:
- Minimising the opportunities for error, peg: with admin interfaces that make easy to do the “right thing” and discourage the “wrong thing”.
  My note: check the name of the script
- Provide fully featured non-production sandbox environments where people can explore and experiment safely.
- Automated testing.
- Quick and easy recovery from human error, fast to rollback configuration changes, roll out new code gradually and tools to recompute data.
- Set up detailed and clear monitoring, such as performance metrics and error rates (telemetry).
- Implement good management practices and training.

Note: me making an availability outage

Scalability

This is how do we cope with increased load. We need to succinctly describe the current load on the system; only then we can discuss growth questions.

Twitter example

Twitter moved to an hybrid of both approaches. Tweets continue to be fanned out to home timelines but a small number of users with a very large number of followers are fetched separately and merged with that user’s home timeline when it is read, like in approach 1.

Describing performance

Performance numbers:

Hadoop: throughput
Online system: server response time

Note: how EFS measures performance

Latency and response time

Better to use percentiles.

Tail latency is affecting customer

Note: how we measure latency

Percentile in practice

Always try to run the workload from client side

Approaches for coping with load

Distributing stateless services across multiple machines is fairly straightforward. Taking stateful data systems from a single node to a distributed setup can introduce a lot of complexity. Until recently it was common wisdom to keep your database on a single node.

Maintainability

Operability. Make it easy for operation teams to keep the system running.
Simplicity. Easy for new engineers to understand the system by removing as much complexity as possible.
Evolvability. Make it easy for engineers to make changes to the system in the future.

Operability: making life easy for operations

A good operations team is responsible for

Monitoring and quickly restoring service if it goes into bad state
Note: lighthouse and tickets
Tracking down the cause of problems
Note: RCA
Keeping software and platforms up to date
Note: patching
Keeping tabs on how different systems affect each other
Anticipating future problems
Establishing good practices and tools for development
Perform complex maintenance tasks, like platform migration
Maintaining the security of the system
Defining processes that make operations predictable
Note: Runbook
Preserving the organisation’s knowledge about the system

Note: accidental complexity: major part of the problem. How to solve it: abstraction

Data models and query language

Most applications are built by layering one data model on top of another. Each layer hides the complexity of the layers below by providing a clean data model. These abstractions allow different groups of people to work effectively.

Relational model vs document model

The roots of relational databases lie in business data processing, transaction processing and batch processing.

The goal was to hide the implementation details behind a cleaner interface.

Not Only SQL(NOSQL) has a few driving forces:

Greater scalability
preference for free and open source software
Specialized query optimizations
Desire for a more dynamic and expressive data model

The network model

Standardized by a committee called the Conference on Data Systems Languages (CODASYL) model was a generalization of the hierarchical model. In the tree structure of the hierarchical model, every record has exactly one parent, while in the network model, a record could have multiple parents.

The links between records are like pointers in a programming language. The only way of accessing a record was to follow a path from a root record called access path.

A query in CODASYL was performed by moving a cursor through the database by iterating over a list of records. If you didn’t have a path to the data you wanted, you were in a difficult situation as it was difficult to make changes to an application’s data model.

The relational model

By contrast, the relational model was a way to lay out all the data in the open” a relation (table) is simply a collection of tuples (rows), and that’s it.

The query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use (the access path).

The relational model thus made it much easier to add new features to applications.

The main arguments in favour of the document data model are schema flexibility, better performance due to locality, and sometimes closer data structures to the ones used by the applications. The relation model counters by providing better support for joins, and many-to-one and many-to-many relationships.

Schema flexibility

Most document databases do not enforce any schema on the data in documents. Arbitrary keys and values can be added to a document, when reading, clients have no guarantees as to what fields the documents may contain.

Document databases are sometimes called schemaless, but maybe a more appropriate term is schema-on-read, in contrast to schema-on-write.

Schema-on-read is similar to dynamic (runtime) type checking, whereas schema-on-write is similar to static (compile-time) type checking.

Data locality for queries

If your application often needs to access the entire document, there is a performance advantage to this storage locality.

The database typically needs to load the entire document, even if you access only a small portion of it. On updates, the entire document usually needs to be rewritten, it is recommended that you keep documents fairly small.

Convergence of document and relational databases

PostgreSQL does support JSON documents. RethinkDB supports relational-like joins in its query language and some MongoDB drivers automatically resolve database references. Relational and document databases are becoming more similar over time.

Query languages for data

SQL is a declarative query language. In an imperative language, you tell the computer to perform certain operations in order.

In a declarative query language you just specify the pattern of the data you want, but not how to achieve that goal.

A declarative query language hides implementation details of the database engine, making it possible for the database system to introduce performance improvements without requiring any changes to queries.

MapReduce querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularised by Google.

Graph-like data models

If many-to-many relationships are very common in your application, it becomes more natural to start modelling your data as a graph.

A graph consists of vertices (nodes or entities) and edges (relationships or arcs).

There are several ways of structuring and querying the data. The property graph model (implemented by Neo4j, Titan, and Infinite Graph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). There are also three declarative query languages for graphs: Cypher, SPARQL, and Datalog.

The foundation: Datalog

Datalog provides the foundation that later query languages build upon. Its model is similar to the triple-store model, generalised a bit. Instead of writing a triple (subject, predicate, object), we write as predicate(subject, object).

We define rules that tell the database about new predicates and rules can refer to other rules, just like functions can call other functions or recursively call themselves.

Rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.

Property graphs

Add graph here

Cypher is a declarative language for property graphs created by Neo4j
Graph queries in SQL. In a relational database, you usually know in advance which joins you need in your query. In a graph query, the number if joins is not fixed in advance. In Cypher :WITHIN*0... expresses “follow a WITHIN edge, zero or more times” (like the * operator in a regular expression). This idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (the WITH RECURSIVE syntax).

Triple-stores and SPARQL

In a triple-store, all information is stored in the form of very simple three-part statements: subject, predicate, object (peg: Jim, likes, bananas). A triple is equivalent to a vertex in graph.

The SPARQL query language

SPARQL is a query language for triple-stores using the RDF data model.