Yunpeng's Blog

Life, coding and everything

In a previous post, we discussed how the various relational operators are implemented in relational database systems. If you have read that post, you probably still remember that there are a few alternative implementations for every operator. Thus, how should RDBMS determine which algorithm (or implementation) to use?

Obviously, to optimize the performance for any query, RDBMS has to select the correct the algorithm based on the query. It would not be desirable to always use the same algorithm. Also, SQL is a declarative language (i.e., as a programmer we only declare what we want to do with the language, not tell how the language should accomplish the task). Therefore, it would be an anti-pattern if the user of the database system needs to specify which algorithm to use when writing the query. Instead, the correct approach would be that the user would treat the entire system as a blackbox. The end-user should not care about which algorithm is picked but expect the performance optimization is guaranteed.

Read more »

We all know that DBMS (database management system) is used to store (a massive amount of) data. However, have you ever wondered how is data stored in DBMS? In this post, we will focus on data storage in RDBMS, the most traditional relational database systems.

Physical Storage

Data can be stored in many different kinds of medium or devices, from the fastest but costy registers to the slow but cheap hard drives, or even magnetic tapes. Nowadays, IaaS providers such as AWS even provides services such as S3 Glacier as a low-cost archiving storage solution. The diagram below shows the memory hierarchy of common devices.

Read more »

This post talks about some basic implementation of relational operators in traditional RDBMS (relational database management systems). It was based on Chapter 14 of the textbook by Raghu Ramakrishnan and Johannes Gehrke.

Below we will talk about the classical evaluation & implementation of relational operators one-by-one, namely:

Read more »

Recently, I was looking at some research papers on the join reorderability. To start with, let’s understand what do we mean by “join reorderability” and why it is important.

Background Knowledge

Here, we are looking at a query optimization problem, specifically join optimization. As mentioned by Benjamin Nevarez, there are two factors in join optimization: selection of a join order and choice of a join algorithm.

As stated by Tan Kian Lee’s lecture notes, common join algorithms include iteration-based nested loop join (tuple-based, page-based, block-based), sort-based merge join and partition-based hash join. We should consider a few factors when deciding which algorithm to use: 1) types of the join predicate (equality predicate v.s. non-equality predicate); 2) sizes of the left v.s. right join operand; 3) available buffer space & access methods.

For a query attempting to join n tables together, we need n - 1 individual joins. Apart from the join algorithm applied to each join, we have to decide in which order these n tables should be joined. We could represent such join queries on multiple tables as a tree. The tree could have different shapes, such as left-deep tree, right-deep tree and bushy tree. The 3 types of trees are compared below on an example of joining 4 tables together.

Read more »