How does a Relational Database Work ? [Part 2]

Malo Le Goff
7 min readJun 20, 2021

--

This article is the second of a series in which we’ll talk about the internal workings of Relational Database Management Systems (RDBMS). You don’t really need to read the first part which is talking about how a database store data but I put the link here in case : https://malolegoff.medium.com/how-does-a-relational-database-work-d2d9e2d9c400

In fact, RDBMS such as PostgreSQL, MySQL, …. are widely used today and among the most complete software architectures in use. They are the result of decades of research and optimization. Yet, they are used as black boxes by most users.

This article aims to look at how a SQL query is processed by an RDBMS so we can see how it works

My goal is not to make you experts in RDBMS in one article as it is a very complex topic but to give a basic understanding of what is going on inside these mysterious systems.

I. Overview of an RDBMS

I will start by giving a broad overview of what a RDBMS is.

An RDBMS has five main components. They are deeply intertwined and interact with each other. We’ll go through each of them during the study of the query. (The Transactional Storage Manager will be for another article)

  • First, the client communication manager : It manages the client connections
  • Then, the relational query processor : Which reformats and optimizes and executes the query
  • When the query is executed, the transactional storage manager makes sure the query is executed properly. IE respecting the ACID properties
  • We also have a process manager. In fact, databases have a pool of processes/threads to manage in order to process queries. That’s the role of the process manager which manages the threads/processes with the “Dispatch and Scheduling” block and checks if the system has enough resources to actually process the query with the admission control block
  • Finally, we have the shared components and utilities : It contains components which don’t belong to any particular block but are still useful. Like the Catalog Manager which stores metadata about our DB. Or the memory manager which handles the RAM of the DBMS. Or monitoring for logging the activity of the DB and providing tools to monitor a DB.

Now that we gave a broad overview of the DBMS, we’ll go through the life of a query together to see how the different components behave and interact with each other.

II. The client connection

The client calls an API to send the query to the client manager of the RDBMS via the ODBC or JDBC protocol.

Once being called, the client manager performs several tasks :

  • It checks the authentication and checks if we have the authorizations for querying the DB
  • If the DB doesn’t have the required resources for a certain time, the client manager closes the connection
  • It stores the results in an in-memory buffer before sending them back to the client

III. The process manager

At the same time, the process manager is called and takes care of 2 things :

  • It maps a thread to the query with the dispatcher
  • It performs admission control to make sure the system is not overused. If so, the query will wait before being sent to the relational query processor.

But if everything is okay, the query is sent to the relational query processor.

IV. The Relational Query Processor

This is where magic happens. We’ll go through each of the sections of the Relational Query Processor to understand fully what it does

A. The Query Parser

After arriving from the client manager, the query goes directly into the query parser.

It carries out several steps like :

  • Switch table names for full names of the form server.database.schema.table. Tables aliases are also substituted by full name
  • Invokes the Catalog manager to check if the table exists in the catalog and that the attribute references are correct.
  • Checks the query syntax : compatibility of tables combined with operators like Union or INTERSECT or EXCEPT, the usage of the attributes, the nesting of subqueries, …
  • Check if the users have enough permissions on the tables.
  • And so on

It also converts the SQL statement into a format easier to process which is known as the parse tree structure of the SQL statement :

Then the parse tree of the query is passed to the query rewrite module for further processing.

B. The Query Re-writer

Query re-writer pre-optimizes the query before passing it to the query optimizer

The re-writer responsibilities are :

  • View merging : If you are using a view in your query, it is replaced by the SQL code of the view
  • Constant arithmetic simplification : R.x < 10+2+R.y is converted into R.x < 12+R.y
  • Subquery flattening and other heuristic rewrites : To prevent optimizing too complex queries, most DBMSs re-write the query in the form SELECT FROM WHERE and do not optimize across blocks (called query normalization)
  • Removal of the unnecessary operators : We don’t have to write a DISTINCT keyword on a field with UNIQUE constraint for instance
  • Partition pruning : If we are using a partitioned table, the re-writer is able to find which partition must be ran
  • And so on

Example :

C. The Query Optimizer

The optimizer transforms the query internal representation (the parse tree we saw just above) into a query plan

A query plan is like a dataflow pipeline where data goes through a sequence of operators (sort, group by, …). These operators can also be responsible for getting the data from the DB (like the HeapScan operator) or join table together (like the IndexJoin operator)

Example of such query plan :

There are many different operators for the same operation. There are for instance several ways of doing a join : the Index Join operator, Index Nested Loop Join operator, Index Loop Join operator, … Also, the operators can have a different order. This 2 factors combined make a lot of possible query plans for a single query.

Given the many possibilities, the optimizer has to choose the cheapest query plan in terms of resources. To do this, it uses a heuristic algorithm.

So, this algorithm takes into account the CPU cost, the memory usage and the disk I/O as metrics and tries to find the cheapest query plan i.e. the cheapest chain of operators.

To find the cheapest query plan SQLite implements a heuristic algorithm called the Nearest Neighbor algorithm

You might be wondering how the query optimizer knows the CPU cost, the memory usage and the disk I/Os for a query plan. Well that’s a very good question. The RDBMS estimates these metrics with statistics it computed about the tables like the most frequent values, the quantiles, …

NB : Statistics of the queried table must be up to date because they help us find the best query plan

Now, based on the resource needed by the query, the admission control decides whether the query must wait or not before being passed to the executor

D. The Query Executor

The query plan is finally executed by the query executor

The query executor invokes the procedures of the operators in the query plan such as :

  • Table access methods (filescan, index scan, …)
  • Query execution algorithms (nested-loops join, hash join, merge join, …), also can be executed in parallel or sequence, it is up to the executor to decide.

Access methods : These are the methods that manage access to the various data structures on disk. Like B+-trees or hash tables. I.E, access methods are our way of accessing data from the disk.

The operators which get the data (access methods) communicate with the transactional storage manager to make sure the query is executed properly.

Once it has been computed, the result is sent back to the client manager

Thanks for reading this ! I hope you enjoyed the article, let me know in the comment what you thought about it !

Here is my LinkedIn if you want to contact me : https://www.linkedin.com/in/malo-le-goff-480452170/?locale=en_US

All the pictures I have taken come from here : All the pictures I have taken come from : https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf. It is an excellent paper about the architecture of a Database System, feel free to read it to complete what I shared with you !

--

--

Malo Le Goff
Malo Le Goff

Written by Malo Le Goff

Student Engineer | Engineering school : IMT Atlantique | Software Engineering & Data Science & Cybersecurity