Chapter 20 Recovery¶
Before Class
Since time is really limited, I have to use CS186 instead in this chapter.
In CS186, Chapter 19 and Chapter 20 are integrated into one chapter called "Logging and Recovery", so the content here is actually the same as Ch19.
Motivation¶
In prior modules we discussed the ACID properties of transactions. In this note we will discuss how to make our database resilient to failures. The two ACID properties that we will learn how to enforce in this note are:
- Durability: If a transaction finishes (commits), we will never lose the result of the transaction.
- Atomicity: Either all or none of the operations in the transaction will persist. This means that we will never leave the database in an intermediate state.
We can take a look at these two properties in the example of swapping a class in CalCentral:
- Durability: Once we say that we have made the change, the student should not suddenly find themselves in old section later on!
- Atomicity: There are two operations: dropping your old class and adding the new one. If the database crashes before it can add your new class, you do not actually want to be dropped out of your old class.
Force/No Force¶
Durability can be a very simple property to ensure if we use a force policy. The force policy states when a transaction finishes, force all modified data pages to disk before the transaction commits. This would ensure durability because disk is persistent; in other words, once a page makes it to disk it is saved permanently. The downside of this approach is performance. We end up doing a lot of unnecessary writes.
The policy we would prefer is no force which says to only write back to disk when the page needs to be evicted from the buffer pool. While this helps reduce unnecessary writes, it complicates durability because if the database crashes after the transaction commits, some pages may not have been written to disk and are consequently lost from memory, since memory is volatile. To address this problem, we will redo certain operations during recovery.
Steal/No-Steal¶
Similarly, it would be easy to ensure atomicity with a no-steal policy. The no-steal policy states that pages cannot be evicted from memory (and thus written to disk) until the transaction commits. This ensures that we do not leave the database in an intermediate state because if the transaction does not finish, then none of its changes are actually written to disk and saved. The problem with this policy is that it handcuffs how we can use memory. We have to keep every modified page in memory until a transaction completes.
We would much rather have a steal policy, which allows modified pages to be written to disk before a transaction finishes. This will complicate enforcing atomicity, but we will fix this problem by undoing bad operations during recovery.
Steal, No-Force¶
To review, we chose to use two policies (steal, no force) that make it difficult to guarantee atomicity and durability, but get us the best performance. The rest of this note will cover how to ensure atomicity and durability while using a steal, no force policy.
Write Ahead Logging¶
To solve these complications we will use logging. A log is a sequence of log records that describe the operations that the database has done.
Update Log Record¶
Each write operation (SQL insert/delete/update) will get its own log UPDATE record.
An UPDATE log record looks like this:
SQL | |
---|---|
1 |
|
The fields are:
- XID: transaction ID - tells us which transaction did this operation
- pageID: what page has been modified
- offset: where on the page the data started changing (typically in bytes), start of the modification
- length: how much data was changed (typically in bytes), length of the modification
- old_data: what the data was originally (used for undo operations)
- new_data: what the data has been updated to (used for redo operations)
Other Log Records¶
There are a few other record types we will use in our log. We will add fields to these log records throughout the note as the need for these fields becomes apparent.
COMMIT
: signifies that a transaction is starting the commit processABORT
: signifies that a transaction is starting the aborting processEND
: signifies that a transaction is finished (usually means that it finished committing or aborting)
WAL Requirements¶
Just like regular data pages, log pages need to be operated on in memory, but need to be written to disk to be stored permanently.
Write Ahead Logging (WAL) imposes requirements for when we must write the logs to disk. In short, write ahead logging is a policy where log records describing an action such as modifying a data page or committing a transaction are flushed to disk before the actual action is flushed to disk or occurs, respectively. The two rules are as follows:
-
Log records must be written to disk before the corresponding data page gets written to disk. This is how we will achieve atomicity. The intuition for this is that if a data page is written first and then the database crashes we have no way of undoing the operation because we don’t know what operation happened!
-
All log records must be written to disk when a transaction commits. This is how we will achieve durability. The intuition is that we need to persistently track what operations a committed transaction has performed. Otherwise, we would have no idea what operations we need to redo. By writing all the logs to disk, we know exactly which operations we need to redo in the event that the database crashes before the modified data pages are written to disk!
Note
1) log打标记完成并被冲刷进硬盘后,内容才可以从内存冲刷进硬盘 —— 这样便于记录“灾后”需要的操作 2) 事务提交时,log必须也要被写入硬盘 —— 这样便于“灾后”的重新操作
WAL Implementation¶
To implement write ahead logging(WAL) we’re going to add a field to our log records called the LSN, which stands for Log Sequence Number. The LSN is a unique increasing number that helps signify the order of the operations (if you see a log record with LSN = 20 then that operation happened after a record with LSN = 10). In this class the LSNs will increase by 10 each time, but this is just a convention(公约). We will also add a \(prevLSN\) field to each log record which stores the last operation from the same transaction (this will be useful for undoing a transaction).
The database will also keep track of the \(flushedLSN\) which is stored in RAM. The flushedLSN keeps track of the LSN of last log record that has been flushed to disk. When a page is flushed, it means that the page has been written to disk; it usually also implies that we evict the page from memory because we don’t need it there anymore. The flushedLSN tells us that any log records before it should not be written to disk because they are already there. Log pages are usually appended to the previous log page on disk, so writing the same logs multiple times would mean we are storing duplicate data which would also mess up the continuity of the log.
We will also add a piece of metadata to each data page called the \(pageLSN\). The pageLSN stores the LSN of the operation that last modified the page. We will use this to help tell us what operations actually made it to disk and what operations must be redone.
-
引入LSN字段:
- 文本中提到,日志记录中将添加一个字段,称为LSN(日志序列号,Log Sequence Number)。
- LSN是一个唯一递增的数字,用来表示操作的顺序。例如,LSN为20的日志记录发生在LSN为10的日志记录之后。
- 在这个实现中,LSN每次增加10,这是一个特定的约定(不是必须的,只是在这个例子中如此约定)。
-
引入prevLSN字段:
- 每个日志记录还会有一个prevLSN字段,存储同一事务中上一个操作的LSN,这对于撤销事务时非常有用。
-
引入flushedLSN:
- 数据库还会在内存中记录一个flushedLSN,它跟踪最后被刷写到磁盘的日志记录的LSN。
- 刷写页面表示该页面已经被写入磁盘,通常还意味着我们会将该页面从内存中移除,因为不再需要在内存中保留它。
- flushedLSN告诉我们,在它之前的任何日志记录都不需要再写入磁盘,因为它们已经存在于磁盘中。
-
引入pageLSN:
- 每个数据页面会添加一块元数据,称为pageLSN,它存储了最后一次修改该页面的操作的LSN。
- pageLSN帮助我们确定哪些操作实际上已经写入磁盘,哪些操作还需要重新执行。
Obviously, we have this:
\(pageLSN_i \leq flushedLSN\)
用一个具体的例子来讲解这段文字中涉及的概念
假设我们有一个数据库,其中有三个操作:
- 操作1:向数据表中插入一条记录。
- 操作2:更新一条现有的记录。
- 操作3:删除一条记录。
我们使用写前日志(WAL)机制来保证数据的一致性和恢复能力。
1)LSN(日志序列号)
- 每个操作都会生成一个日志记录,并且每个日志记录都会有一个唯一的LSN。
- 假设我们从LSN = 10开始,LSN每次增加10。
- 操作1生成的日志记录有LSN = 10。
- 操作2生成的日志记录有LSN = 20。
- 操作3生成的日志记录有LSN = 30。
这些LSN表明了操作的顺序,LSN越大,操作越晚。
2)prevLSN(前一个LSN)
- prevLSN记录了同一事务中上一个操作的LSN,用于追踪事务的操作顺序。
- 比如,我们有一个事务进行以下操作:
- 操作1(LSN = 10):向数据表中插入一条记录。因为这是事务的第一个操作,所以它的
prevLSN
为空。 - 操作2(LSN = 20):更新记录。此时
prevLSN = 10
,指向操作1的LSN。 - 操作3(LSN = 30):删除记录。此时
prevLSN = 20
,指向操作2的LSN。
通过prevLSN,我们可以回溯到事务中所有之前的操作,方便在需要时撤销这些操作。
3)flushedLSN(已刷写的LSN)
- flushedLSN记录了最后一个被写入磁盘的日志记录的LSN。
- 假设数据库系统将日志记录写入磁盘时,flushedLSN更新为操作2的LSN(即flushedLSN = 20)。
- 这意味着LSN为10和20的日志记录都已经安全地写入磁盘。
- 如果此时系统崩溃,恢复时只需要重新执行LSN大于20的操作,即操作3,因为前面的操作已经持久化在磁盘上了。
4)pageLSN(页面LSN)
- pageLSN记录了最后一次修改该数据页面的操作的LSN。
- 假设:
- 操作1(LSN = 10)插入了一条记录,这条记录所在的数据页面的pageLSN更新为10。
- 操作2(LSN = 20)更新了同一页面中的另一条记录,pageLSN更新为20。
如果系统在操作2之后崩溃,通过检查数据页面的pageLSN,我们知道这次更新已经写入了磁盘,不需要重新执行操作2。
Aborting a Transaction¶
Before getting into recovering from a crash, let’s figure out how a database can abort a transaction that is in progress. We may want to abort a transaction because of deadlock, or a user may decide to abort because the transaction is taking too long. Transactions can also be aborted to guarantee the C for consistency in ACID if an operation violates some integrity constraint. Finally, a transaction may need to be aborted due to a system crash! We need to ensure that none of the operations are still persisted to disk once the abort process finishes.
Abort and CLR Log Records
The first thing we will do is write an ABORT record to the log to signify that we are starting the process. Then we will start at the last operation in the log for that transaction. We will undo each operation in the transaction and write a CLR record to the log for each undone operation.
A CLR (Compensation Log Record) is a new type of record signifying that we are undoing a specific operation. It is essentially the same thing as an UPDATE record (it stores the previous state and the new state), but it tells us that this write operation happened due to an abort.
这里我们在讨论数据库如何中止一个正在进行的事务,以及如何在中止过程中使用日志记录来保证数据的一致性。
在数据库系统中,事务是一个逻辑操作单元,事务中的所有操作要么全部成功,要么全部失败,这就是所谓的ACID原则中的“原子性”。但是,有时我们可能需要中止(Abort)一个事务,原因包括:
- 死锁:两个或多个事务互相等待对方释放资源,导致无法继续执行。
- 用户中止:用户可能觉得事务执行时间过长,选择手动中止事务。
- 一致性:如果一个操作违反了数据库的完整性约束(例如试图插入重复的主键),事务必须中止以确保数据的一致性。
- 系统崩溃:如果系统崩溃,正在执行的事务可能需要中止。
为了确保中止事务后数据库能够正确恢复,我们使用日志记录来跟踪中止过程。具体步骤如下:
- 写入ABORT记录
- ABORT记录:一旦决定中止事务,首先会在日志中写入一条ABORT记录,以标识中止过程的开始。这条记录告诉系统和管理员,该事务的所有操作将被撤销。
- 撤销操作并写入CLR记录
- 从最后一个操作开始撤销:从该事务的最后一个操作开始,逐步撤销每个操作。
- CLR记录(补偿日志记录):对于每个被撤销的操作,都会在日志中写入一条CLR记录。CLR记录是一种新的日志记录类型,它与更新记录类似,但特别标识这是由于事务中止而执行的撤销操作。
- CLR记录的作用:CLR记录存储了被撤销操作的原始状态和撤销后的新状态。与普通的UPDATE记录不同的是,CLR记录明确表明这个写操作是由于中止事务而发生的,不是正常的操作。
Recovery Data Structures¶
We will keep two tables of state to make the recovery process a little bit easier. The first table is called the transaction table and it stores information on the active transactions. The transaction table has three fields:
- XID: transaction ID
- status: either running, committing, or aborting
- \(lastLSN\): the LSN of the most recent operation for this transaction
An example of the transaction table is here:
The other table we maintain is called the Dirty Page Table (DPT). The DPT keeps track of what pages are dirty (recall from many modules ago that dirty means the page has been modified in memory, but has not been flushed to disk yet). This information will be useful because it will tell us what pages have operations that have not yet made it to disk. The DPT only has two columns:
- Page ID
- \(recLSN\): the first operation to dirty the page
An example of the DPT is here:
One thing to note is that both of these tables are stored in memory; so when recovering from a crash, you will have to use the log to reconstruct the tables. We will talk about a way to make this easier (check-pointing) later in the note.
We just ignore these inequalities for now.
Undo Logging¶
We’ve covered a lot of background information on how the database writes to its log and how it aborts transactions when it is running normally. Now, let’s finally get into the reason for all this logging - recovering from failures.
One possible recovery mechanism is Undo Logging. Note that Undo Logging does not, in fact, use write ahead logging (WAL) that we previously discussed (we will get back to that in a bit). Further, it uses the force and steal mechanisms in terms of buffer pool management.
Force + Steal
Undo Logging 是一种数据库恢复机制,主要用于撤销未提交事务的影响,同时保留已提交事务的结果。它与我们之前讨论的写前日志(WAL, Write Ahead Logging)不同,Undo Logging有其独特的规则和处理方式。
Undo Logging中有四种日志记录类型:
- Start记录:标记事务的开始。
- Commit记录:标记事务的成功提交。
- Abort记录:标记事务的中止。
- Update记录:记录事务对数据的修改,特别是修改前的旧值。
Undo Logging中的两个关键规则如下:
- 更新日志记录必须先于数据页写入磁盘:
- 如果一个事务修改了一个数据元素,则必须先将相应的更新日志记录写入磁盘,然后再将包含该修改的数据页写入磁盘。
- 这样做是为了确保在新值永久替换旧值之前,旧值已经被安全地记录在磁盘上。
- 事务提交前,必须先将修改后的数据页写入磁盘:
- 如果一个事务提交,则 必须先将所有修改后的数据页写入磁盘,然后再将提交记录本身写入磁盘。
- 这个规则确保所有事务的更改在事务实际提交之前已经写入磁盘。
- 这一点与WAL不同。在Undo Logging中,修改后的数据页会在写入提交记录之前被写入磁盘。
规则的实现:Force和Steal机制
- Steal策略:在事务提交之前,脏页(即修改过但还未写入磁盘的数据页)可以被写入磁盘。这对应于规则1,因为数据页可以在事务提交之前被写入磁盘,但前提是相应的日志记录已经被写入。
- Force策略:在提交记录写入之前,所有修改过的脏页必须写入磁盘。这对应于规则2,因为它确保了所有更改在事务提交时都已经持久化。
故障恢复过程
当系统崩溃后,恢复管理器会启动,并扫描日志来决定如何恢复:
- 从日志的末尾开始扫描,逐条检查日志记录以确定每个事务是否已经完成:
- COMMIT/ABORT记录:标记该事务为已完成,无需撤销。
- UPDATE记录:如果事务未完成,恢复旧值并写入磁盘;如果事务已完成,忽略这个更新。
- START记录:忽略,因为它只是标记事务的开始,没有其他作用。
- 扫描范围:通常从日志的末尾回溯到日志的开始,以确保所有未提交的事务都能被正确处理。
- 除非使用了检查点(checkpointing),可以减少需要扫描的日志范围(检查点在后续讨论中会提到)。
Redo Logging¶
Now, lets talk about another form of logging based recovery called Redo Logging. Here, Redo Logging implements the no-force no-steal strategy for buffer management. In Redo Logging, we have the same type of log records as Undo Logging, though only difference is for the Update log record, where instead of storing the previous value for particular data elements, we store the new value it is going to write.
The idea behind Redo Logging is similar to Undo Logging, except that at recovery time instead of undoing all the transactions that are incomplete, we redo the actions of all the transactions that were committed instead. Meanwhile, we leave all the uncommitted transactions alone. Like Undo Logging, we also have also have a rule to abide by
- If a transaction modifies a data element \(X\), then both the update record and commit record must be written to to disk before dirty data page itself - this is the no-steal policy. Hence, the dirty data page is written later than the transaction commit record, and is essentially write ahead logging.
Recovery for Redo Logging is rather straightforward: we just read the log from the beginning, and redo all updates of committed transactions. While this may seem like a lot of operations, it can be optimized, like Undo Logging, through check-pointing.
日志记录类型:Redo Logging与Undo Logging类似,记录类型包括Start
、Commit
、Abort
和Update
。但两者在Update
日志记录上有一个关键区别:
- Undo Logging 记录的是数据元素被修改之前的旧值
- Redo Logging 则记录数据元素被修改之后的新值,即将要写入的数据
No-Force + No-Steal
- No-Force策略:
- 在事务提交时,数据页不需要立即写入磁盘(即不强制写回)。这意味着即使事务提交了,修改后的数据页仍然可以留在内存中,等待适当的时候再写入磁盘。
- No-Steal策略:
- 事务对数据的修改必须先记录在日志中,然后再将日志记录(包括提交记录)写入磁盘。只有在日志写入磁盘之后,数据页才会被写入磁盘。
- 这实际上实现了写前日志(WAL, Write Ahead Logging)的概念,即确保在数据页被写回磁盘之前,相应的日志记录已经被安全地写入磁盘。
ARIES Recovery Algorithm¶
When a database crashes, the only things it has access to are the logs that made it to disk and the data pages on disk. From this information, it should restore itself so that all committed transactions’ operations have persisted (durability) and all transactions that didn’t finish before the crash are properly undone (atomicity). The recovery algorithm consists of 3 phases that execute in the following order:
- Analysis Phase: reconstructs the Xact Table and the DPT
- Redo Phase: repeats operations to ensure durability
- Undo Phase: undoes operations from transactions that were running during the crash to ensure atomicity
Let’s go through each phase in detail.
Analysis Phase¶
The entire purpose of the analysis phase is to rebuild what the Xact Table and the DPT looked like at the time of the crash. To do this, we scan through all of the records in the log beginning from the start. We modify our tables according to the following rules:
On any record that is not an END record: add the transaction to the the Xact Table (if necessary). Set the lastLSN of the transaction to the LSN of the record you are on
- If the record is a COMMIT or an ABORT record, change the status of the transaction in the Xact Table accordingly
- If the record is an UPDATE record, if the page is not in the DPT add the page to the DPT and set recLSN equal to the LSN
- If the record is an END record, remove the transaction from the Xact Table.
At the end of the analysis phase, for any transactions that were committing we will also write the END record to the log and remove the transaction from the Xact Table. Additionally, any transactions that were running at the time of the crash need to be aborted and the abort record should be logged. (Note: on some past exam questions we have asked for the status of the tables before this final pass without explicitly stating that assumption. However, that assumption does not hold for this semester’s material - make sure to perform this final pass through the transaction table to find the state at the end of analysis.)
Checkpoint Technique
One problem with the analysis phase so far is that it requires the database to scan through the entire log. In production databases this is not realistic as there could be thousands or millions of records. To speed up the analysis phase, we will use checkpointing.
Checkpointing writes the contents of the Xact Table and the DPT to the log. This way, instead of starting from the beginning of the log, we can start from the last checkpoint. For the rest of this note, we consider a variant of checkpointing called fuzzy checkpointing that actually writes two records to the log, a \(<BEGIN\_CHECKPOINT>\) record that says when checkpointing started and an \(<END\_CHECKPOINT>\) record that says when we finished writing the tables to the log.
The tables written to the log can be the state of tables at any point between the \(<BEGIN\_CHECKPOINT>\) and \(<END\_CHECKPOINT>\). This means we need to start at the \(<BEGIN\_CHECKPOINT>\) because we’re not sure if the records after it are actually reflected in the tables that were written to the log.
Example
Analysis Phase Example:
click here to see the whole process
Redo Phase¶
The next phase in recovery is the Redo Phase which ensures durability. We will repeat history in order to reconstruct the state at the crash. We start at the smallest recLSN in the DPT because that is the first operation that may not have made it to disk. We will redo all UPDATE
and CLR
operations unless one of the following conditions is met:
- The page is not in the DPT. If the page is not in the DPT it implies that all changes (and thus this one!) have already been flushed to disk.
- recLSN > LSN. This is because the first update that dirtied the page occurred after this operation. This implies that the operation we are currently at has already made it to disk, otherwise it would be the recLSN.
- pageLSN (disk) ≥ LSN. If the most recent update to the page that made it to disk occurred after the current operation, then we know the current operation must have made it to disk.
click here to see the whole process
Undo Phase¶
The final phase in the recovery process is the undo phase which ensures atomicity. The undo phase will start at the end of the log and works its way towards the start of the log. It undoes every UPDATE
(only UPDATEs!) for each transaction that was active (either running or aborting) at the time of the crash so that we do not leave the database in an intermediate state. It will not undo an UPDATE
if it has already been undone (and thus a CLR record is already in the log for that UPDATE
).
For every UPDATE
the undo phase undoes, it will write a corresponding CLR record to the log. CLR records have one additional field that we have not yet introduced called the undoNextLSN. The undoNextLSN stores the LSN of the next operation to be undone for that transaction (it comes from the prevLSN of the operation that you are undoing). Once you have undone all the operations for a transaction, write the END record for that transaction to the log.
Appendix 1 explains how this is implemented efficiently.
click here to see the whole process
Conclusion¶
We’ve now covered the entire ARIES recovery algorithm! We first reconstruct the state of the database before the crash by recreating the transaction and dirty-page tables and re-applying any modifications that were not flushed. We then abort any transactions that were running before the crash and undo all of their effects in a single, efficient pass. Below is a higher-level view of how the three phases interact with log records to bring the database back to a consistent state:
In this note, we first covered how databases can guarantee that they will recover from failure even while using a steal, no-force policy. Here is a summary of the performance and logging implications of the different types of policies:
We then covered how the database uses Write Ahead Logging policies to log all operations when it is running correctly. We finally covered how the database uses the log in a 3 step process (analysis, redo, undo) to recover from failure and restore the database to a correct state.
Appendix LSN list¶
There are a lot of different LSNs, so here is a list of what each one is:
- LSN: stored in each log record. Unique, increasing, ordered identifier for each log record
- flushedLSN: stored in memory, keeps track of the most recent log record written to disk
- pageLSN: LSN of the last operation to update the page (in memory page may have a different pageLSN than the on disk page)
- prevLSN: stored in each log record, the LSN of the previous record written by the current record’s transaction
- lastLSN: stored in the Xact Table, the LSN of the most recent log record written by the transaction
- recLSN: stored in the DPT, the log record that first dirtied the page since the last checkpoint
- undoNextLSN: stored in CLR records, the LSN of the next operation we need to undo for the current record’s transaction
When a machine crashes, the bits in memory are "erased," which is why memory is not persistent and we have to rely on persistent storage devices such as disk to guarantee durability.