DDIA - 2025
DDIA
Chapter 1: Reliable, Scalable, and Maintainable Applications
A data-intensive application is typically built from standard building blocks. They usually need to:
- Store data (databases)
- Speed up reads (caches)
- Search data (search indexes)
- Send a message to another process asynchronously (stream processing)
- Periodically crunch data (batch processing)
The ultimate reason for all the difference databases: different applications have different requirements.
Thinking about Data Systems
Note: Kafka and the system we are building is very similar
- Reliability. To work correctly even in the face of adversity.
- Scalability. Reasonable ways of dealing with growth.
- Maintainability. Be able to work on it productively.
Reliability
A fault is usually defined as one component of the system deviating from its spec, whereas failure is when the system as a whole stops providing the required service to the user.
Note: check the Netflix chaos monkey
Hardware faults
There is a move towards systems that tolerate the loss of entire machines. A system that tolerates machine failure can be patched one node at a time, without downtime of the entire system (rolling upgrade).
Note: talk about the number of instance failure in our system
Software Errors
- Software errors. It is unlikely that a large number of hardware components will fail at the same time. Software errors are a systematic error within the system, they tend to cause many more system failures than uncorrelated hardware faults.
Human Errors
- Human errors. Humans are known to be unreliable. Configuration errors by operators are a leading cause of outages. You can make systems more reliable:
- Minimising the opportunities for error, peg: with admin interfaces that make easy to do the “right thing” and discourage the “wrong thing”.
My note: check the name of the script - Provide fully featured non-production sandbox environments where people can explore and experiment safely.
- Automated testing.
- Quick and easy recovery from human error, fast to rollback configuration changes, roll out new code gradually and tools to recompute data.
- Set up detailed and clear monitoring, such as performance metrics and error rates (telemetry).
- Implement good management practices and training.
- Minimising the opportunities for error, peg: with admin interfaces that make easy to do the “right thing” and discourage the “wrong thing”.
Note: me making an availability outage
Scalability
This is how do we cope with increased load. We need to succinctly describe the current load on the system; only then we can discuss growth questions.
Twitter example
Twitter moved to an hybrid of both approaches. Tweets continue to be fanned out to home timelines but a small number of users with a very large number of followers are fetched separately and merged with that user’s home timeline when it is read, like in approach 1.
Describing performance
Performance numbers:
Hadoop: throughput
Online system: server response time
Note: how EFS measures performance
Latency and response time
Better to use percentiles.
Tail latency is affecting customer
Note: how we measure latency
Percentile in practice
Always try to run the workload from client side
Approaches for coping with load
Distributing stateless services across multiple machines is fairly straightforward. Taking stateful data systems from a single node to a distributed setup can introduce a lot of complexity. Until recently it was common wisdom to keep your database on a single node.
Maintainability
- Operability. Make it easy for operation teams to keep the system running.
- Simplicity. Easy for new engineers to understand the system by removing as much complexity as possible.
- Evolvability. Make it easy for engineers to make changes to the system in the future.
Operability: making life easy for operations
A good operations team is responsible for
- Monitoring and quickly restoring service if it goes into bad state
Note: lighthouse and tickets - Tracking down the cause of problems
Note: RCA - Keeping software and platforms up to date
Note: patching - Keeping tabs on how different systems affect each other
- Anticipating future problems
- Establishing good practices and tools for development
- Perform complex maintenance tasks, like platform migration
- Maintaining the security of the system
- Defining processes that make operations predictable
Note: Runbook - Preserving the organisation’s knowledge about the system
Note: accidental complexity: major part of the problem. How to solve it: abstraction
Data models and query language
Most applications are built by layering one data model on top of another. Each layer hides the complexity of the layers below by providing a clean data model. These abstractions allow different groups of people to work effectively.
Relational model vs document model
The roots of relational databases lie in business data processing, transaction processing and batch processing.
The goal was to hide the implementation details behind a cleaner interface.
Not Only SQL(NOSQL) has a few driving forces:
- Greater scalability
- preference for free and open source software
- Specialized query optimizations
- Desire for a more dynamic and expressive data model
The network model
Standardized by a committee called the Conference on Data Systems Languages (CODASYL) model was a generalization of the hierarchical model. In the tree structure of the hierarchical model, every record has exactly one parent, while in the network model, a record could have multiple parents.
The links between records are like pointers in a programming language. The only way of accessing a record was to follow a path from a root record called access path.
A query in CODASYL was performed by moving a cursor through the database by iterating over a list of records. If you didn’t have a path to the data you wanted, you were in a difficult situation as it was difficult to make changes to an application’s data model.
The relational model
By contrast, the relational model was a way to lay out all the data in the open” a relation (table) is simply a collection of tuples (rows), and that’s it.
The query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use (the access path).
The relational model thus made it much easier to add new features to applications.
The main arguments in favour of the document data model are schema flexibility, better performance due to locality, and sometimes closer data structures to the ones used by the applications. The relation model counters by providing better support for joins, and many-to-one and many-to-many relationships.
Schema flexibility
Most document databases do not enforce any schema on the data in documents. Arbitrary keys and values can be added to a document, when reading, clients have no guarantees as to what fields the documents may contain.
Document databases are sometimes called schemaless, but maybe a more appropriate term is schema-on-read, in contrast to schema-on-write.
Schema-on-read is similar to dynamic (runtime) type checking, whereas schema-on-write is similar to static (compile-time) type checking.
Data locality for queries
If your application often needs to access the entire document, there is a performance advantage to this storage locality.
The database typically needs to load the entire document, even if you access only a small portion of it. On updates, the entire document usually needs to be rewritten, it is recommended that you keep documents fairly small.
Convergence of document and relational databases
PostgreSQL does support JSON documents. RethinkDB supports relational-like joins in its query language and some MongoDB drivers automatically resolve database references. Relational and document databases are becoming more similar over time.
Query languages for data
SQL is a declarative query language. In an imperative language, you tell the computer to perform certain operations in order.
In a declarative query language you just specify the pattern of the data you want, but not how to achieve that goal.
A declarative query language hides implementation details of the database engine, making it possible for the database system to introduce performance improvements without requiring any changes to queries.
MapReduce querying
MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularised by Google.
Graph-like data models
If many-to-many relationships are very common in your application, it becomes more natural to start modelling your data as a graph.
A graph consists of vertices (nodes or entities) and edges (relationships or arcs).
There are several ways of structuring and querying the data. The property graph model (implemented by Neo4j, Titan, and Infinite Graph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). There are also three declarative query languages for graphs: Cypher, SPARQL, and Datalog.
The foundation: Datalog
Datalog provides the foundation that later query languages build upon. Its model is similar to the triple-store model, generalised a bit. Instead of writing a triple (subject, predicate, object), we write as predicate(subject, object).
We define rules that tell the database about new predicates and rules can refer to other rules, just like functions can call other functions or recursively call themselves.
Rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.
Property graphs
Add graph here
- Cypher is a declarative language for property graphs created by Neo4j
- Graph queries in SQL. In a relational database, you usually know in advance which joins you need in your query. In a graph query, the number if joins is not fixed in advance. In Cypher
:WITHIN*0...
expresses “follow aWITHIN
edge, zero or more times” (like the*
operator in a regular expression). This idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (theWITH RECURSIVE
syntax).
Triple-stores and SPARQL
In a triple-store, all information is stored in the form of very simple three-part statements: subject, predicate, object (peg: Jim, likes, bananas). A triple is equivalent to a vertex in graph.
The SPARQL query language
SPARQL is a query language for triple-stores using the RDF data model.