Chapter 11 Join Algorithms¶
Note
Since this part is of special difficulty in CMU-15445, I choose to use UCB CS186 instead.
This part is copied from UCB-CS186 notes
Here we ignored numerous process graphs which are essential for you to deeply understand the exact course, you can learn the details from here
Introduction¶
Let’s begin with the simplest question: what, exactly, is a join
?
If you remember the SQL project, you’ll remember writing things like R INNER JOIN S ON R.name = S.name
and other similar statements. What that actually meant is that you take two relations,
The SQL lecture slides are a great resource for more clarifications on what joins actually are. Before we get into the different join algorithms, we need to discuss what happens when the new joined relation consisting of
Don’t worry if this sounds confusing right now; we will revisit it in the Query Optimization module, but the important thing to remember for now is that the final write cost is not included in our join cost models!
Comparison of Loop Joins¶
- Simple Nested Loop Join
- Page Nested Loop Join
- Block Nested Loop Join
Simple Nested Loop Join¶
(SNLJ, just For-Loops)
Let’s start with the simplest strategy possible. Let’s say we have a buffer of B pages, and we wish to join two tables,
Python | |
---|---|
1 2 3 4 5 |
|
Caution⚠️
You must be careful that we're discussing I/Os times rather than Time-Complexity here.
This would be a great thing to do, but the theme of the class is really centered around optimization and minimizing I/Os. For that, this is a pretty poor scheme, because we take each record in
The I/O cost of this would then be [R] + |R|[S] , where [R] is the number of pages in
And while we might be able to optimize things a slight amount by switching the order of
[R] rather than |R|
SNLJ does not incur |R| I/Os to read every record in R.
It will cost [R] I/Os because it’s really doing something more like “for each page
This point is also the reason for the statement "[R] + |R|[S]" rather than "[R] + |R|*|S|"
Question about the format above
为什么是 [R] + |R|[S] 而不是 [R] + |R|*|S|
这是因为I/O操作的基本单位是页(page),而不是记录(record)。每次对
Page Nested Loop Join¶
It’s clear that we don’t want to read in every single page of
That’s called page nested loop join (PNLJ). Here’s the pseudocode for it:
Python | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
The I/O cost of this is somewhat better. It’s
This can be optimized by keeping the smaller relation between
Caution⚠️
We're discussing I/Os times rather than Time-Complexity here.
That's why the number is not
Block Nested Loop Join¶
Page Nested Loop Join is a lot better! The only problem is that we’re still not fully utilizing our buffer as powerfully as we can. We have
Remember that the fewer times we read in
In this join,
This is called Chunk Nested Loop Join (or Block Nested Loop Join). The key idea here is that we want to utilize our buffer to help us reduce the I/O cost, and so we can reserve as many pages as possible for a chunk of
Python | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Then, the I/O cost of this can be written as
Visual Comparison of Loop Joins¶
Index Nested Loop Join¶
There are times, however, when Block Nested Loop Join isn’t the best thing to do. Sometimes, if we have an index on
Python | |
---|---|
1 2 3 |
|
The I/O cost is $[R] + |R| \times $(cost to look up matching records in S).
The cost to look up matching records in
Hash Join¶
Notice that in this entire sequence, we’re really trying to look for matching records. Hash tables are really nice for looking up matches, though; even if we don’t have an index, we can construct a hash table that is B-2 pages big on the records of R, fit it into memory, and then read in each record of
There’s a problem with this, however; this relies on
Why ≤ pages big
You can take a look at "Block Nested Loop Join" above.
To fix this, we repeatedly hash
More specifically, consider each pair of corresponding partitions
This procedure is called Grace Hash Join, and the I/O cost of this is: the cost of hashing plus the cost of Naive Hash Join on the subsections.
Procedure
- Hash
and into partitions of size B-1.- For each pair of corresponding partitions
and , if and are both > B-2 pages big, hash both partitions into smaller ones. - Else, if either
or ≤ B-2 pages, stop partitioning and load the smaller partition into memory to build an in-memory hash table and perform a Naive Hash Join with the larger partition in the pair.
- For each pair of corresponding partitions
- Pseudocode:
Python | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
The cost of hashing can change based on how many times we need to repeatedly hash on how many partitions.
The cost of hashing a partition P includes the I/O’s we need to read all the pages in P and the I/O’s we need to write all the resulting partitions after hashing partition P.
The Naive Hash Join portion cost per partition pair is the cost of reading in each page in both partitions after you have finished. Grace Hash is great, but it’s really sensitive to key skew (密钥偏移), so you want to be careful when using this algorithm.
Key Skew
Key skew is when we try to hash but many of the keys go into the same bucket.
Key skew happens when many of the records have the same key. For example, if we’re hashing on the column which only has "yes" as values, then we can keep hashing but they will all end up in the same bucket no matter which hash function we use.
Sort-Merge Join¶
There’s also times when it helps for us to sort R and S first, especially if we want our joined table to be sorted on some specific column. In those cases, what we do is first sort R and S. Then:
(1) We begin at the start of
Python | |
---|---|
1 2 3 4 |
|
(2) Now, let’s assume we’ve gotten to a match. Let’s say this pair is
Python | |
---|---|
1 2 3 |
|
(3) Now, go to the next record in
Text Only | |
---|---|
1 2 3 4 5 6 7 |
|
This is called Sort-Merge Join.
The average I/O cost is: cost_to_sort
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Let’s take a look at an example. Let the table on the left be
We will advance the pointer (the red arrow) on
Then we will advance the pointer on
We advance the pointer on
We then advance
An Important Refinement¶
You can learn more details here