Thursday, November 26, 2009

SQL Server: Recovering from Disasters Using Backups


Paul S. Randal

There isn't much point taking SQL Server backups unless you know how to restore them. If you have anything more complicated than just a full database backup, you're going to need to know some RESTORE options to be able to successfully restore your database to the desired point in time.
This is even more the case if you have a complicated database layout or a complex backup strategy and you want to be able to restore, for example, a single file or filegroup, or take advantage of partial database availability.

As long as you have an effective backup strategy and your backups are valid, you should be able to recover from a disaster within your Recovery Time Objective (RTO) and to your Recovery Point Objective (RPO). In the first article in this three-part series, I discussed the various types of backups and how to formulate a backup strategy (see "Understanding SQL Server Backups" in the July 2009 issue).
In this article, I'll explain how restore works and how to perform some of the more common restore operations. It will be helpful if you've read the backup article, and the background material I mentioned in that article's introduction. I'm also going to explain a few more tricky operations, such as doing a point-in-time restore and an online piecemeal restore with partial database availability.
Just as in the previous article on BACKUP, I'm not going to explain how the RESTORE syntax works or go through the specific steps of all restore operations. SQL Server Books Online does an excellent job of that. See "RESTORE (Transact-SQL)" for more info, especially the examples spread throughout the topic. There are actually so many options to RESTORE that it's a whole other topic to explain them! "Backing Up and Restoring How-to Topics (SQL Server Management Studio)" explains how to use the tools to perform restores.

The Four Phases of Restore

Let's start with how restore actually works. A restore operation has up to four phases:
  1. File creation and initialization
  2. Data and/or transaction log copy
  3. REDO phase of recovery
  4. UNDO phase of recovery
One of the primary goals of disaster recovery is to bring the database online as quickly as possible. If your disaster-recovery plan involves restoring backups (instead of, say, failing over to a database mirror), you're going to want the restore process to be as fast as possible. With each of the four restore steps in mind, is there anything you can do to speed them up?
The first step can be essentially skipped if the data and log files already exist. This means that if you're going to overwrite an existing database, don't drop the database before doing the restore. Instead, use the WITH REPLACE option on the first restore operation to tell SQL Server to use the existing files. If the files don't exist, they will be created and then initialized. Creating the files is very fast, but the process of zero-initializing them can be very slow.
For security reasons, and by default, all database files are zero-initialized. You can enable instant file initialization for the SQL Server instance, which skips the zeroing process for data file create and grow operations, including those required during a restore -- possibly saving hours of downtime, even if the data files are only gigabytes in size. Transaction log files are always zero-initialized because of the circular nature of the transaction log itself.
You can read more about all of this, with references to Books Online, in my "Instant Initialization" blog post category.
You may be wondering about the second phase -- why does the item say "and/or transaction log"? If you read the previous article, you'll remember that all full and differential backups also include some transaction log records, to enable the database to be restored to a transactionally consistent point in time. Phase two is a pure copy operation -- no processing of the data is performed -- so the main way to speed this up is to have a better-performing I/O subsystem. This is one of the few times when it's acceptable to "throw hardware at the problem."
The other way to speed up the copy phase is to use some kind of backup compression technology, either native to SQL Server 2008 or through one of the various third-party solutions. The two parts of the copy phase are reading from the backup media and writing out to the data and/or log files. If you can do fewer reads (using compressed backup media), you can speed up the overall process, at the expense of a small amount of CPU resources.
Phases three and four are about running recovery on the transaction log to bring the database to a transactionally consistent point in time. I explained the details of recovery in the February 2009 article "Understanding Logging and Recovery in SQL Server". Note that phase four is optional, as I'll explain later.
Suffice it to say that the more transaction log that needs to be recovered during a restore, the longer the restore will take. This means that if, for instance, you have a full database backup from a week ago and hourly transaction log backups since then, the restore process will essentially replay all the transactions from the last week before completing. I discussed a solution for this in the previous article -- adding differential backups into a full-plus-log backup strategy.
A differential database backup contains all the datafile pages that have changed since the last full database backup, and can be used in a restore operation to avoid having to replay all the transactions that occurred in the period between the full database backup and the differential database backup. This can vastly reduce the time it takes to restore the database, but at the expense of a slightly more complicated backup strategy.
You can find more information in the Books Online topic "Optimizing Backup and Restore Performance in SQL Server".

What Do You Need to Restore?

When disaster strikes, the first thing you need to do is work out what has been damaged, as this is going to dictate the actions you must take to recover from the disaster. For storage media failure, the possibilities include:
  • Damage to the entire database (for instance, whatever was storing the database was destroyed, or the database was a single data file and it was damaged).
  • Damage to a single filegroup of a multi-filegroup database.
  • Damage to a single file of a multi-file filegroup.
  • Damage to a single page in the database.
  • Damage spread through the database.
You can ascertain the damage by looking through the SQL Server error log for notifications that file(s) are inaccessible, that page-read errors occurred (for instance, page checksum failures or a torn-page detection error), or that general corruption was encountered. If damage occurred, it is usual practice to run the DBCC CHECKDB consistency-checking operation to get an idea of how pervasive the damage is.
An explanation of consistency checking is beyond the scope of this article, but you can watch a video of a presentation I made at the Tech-Ed IT Forum in November 2008 titled "Corruption Survival Techniques", and listen to a TechNet Radio interview from earlier this year where I discuss database corruption (direct download links are here).
Disasters are not limited to I/O subsystem or server failures -- there's also human error to consider. Database tables (or data from them) are often accidentally deleted by poorly programmed applications or careless Transact-SQL statements (the "I didn't realize I was on the production server" scenario). In such cases, it can be very difficult to figure out what needs to be restored and from what point in time, especially if no one owns up to making the mistake. You might get lucky using standard reports from the default trace where the DDL operation is still available or the DELETE statement was caught by your own logging -- but often there's no record of who did what to the database. I'll discuss recovering from this situation in more detail later. Regardless of who performed the accidental data deletion or when it happened, the longer you wait to recover -- or the more time that passes before you are made aware of the problem -- the more complex it can be to recover.
So, as a first step if the database is running in the FULL recovery model and the transaction log is undamaged, perform a tail-of-the-log backup to ensure that all transactions up to the point of the disaster are backed up. This "final" transaction log backup will have everything up until the time of the disaster and can be used to bring the database being restored as far as possible, possibly up-to-the-minute.
In a nutshell, you need to work out what you have to restore. Then it becomes a question of what you are able to restore.

What Are You Able to Restore?

The aim of any restore operation is to restore the fewest possible backups, so the restore operation is as fast as possible and completes within your RTO, while also allowing you to meet your RPO.
The main question to ask here is "What backups do I have?" If the only backup you have is a full database backup from a week ago and the whole database has been lost, there's only one restore option -- to a point in time a week ago, losing all work since then. Simply put, your backup strategy should always ensure that you are able to restore what you need to in the event of a disaster, as I discussed in the previous article.
So how can you determine what backups you have available? First, you can query that various backup history tables in the msdb database. These tables contain a record of all backups that have been taken in the SQL Server instance since the last time the backup history tables were cleared out.
As far as the backups themselves are concerned, it is a best practice to name backup files to include the database, type of backup, date and time so that the backup can be identified at a glance. If you haven't done this, you can find out what a backup file contains using the RESTORE HEADERONLY command. This will display the contents of the backup file's header, essentially the metadata that describes the backup itself. You can read more in the Books Online topic "Viewing Information About Backups".
Using either method, you are trying to work out the restore sequence to use to restore the damaged or deleted data. A restore sequence is a set of backups that must be restored and the appropriate order in which to restore them. The restore sequence may be as simple as just one full backup (of a database, filegroup or file), or a complicated set of full, differential and transaction log backups.
For instance, imagine a scenario where the backup strategy involves full database, differential database and transaction log backups. If a system crash occurs and the data files are damaged, what is the restore sequence? Figure 1 illustrates this example.
In this case, the shortest and fastest restore sequence is the most recent full database backup (F), the most recent differential database backup (D2), and then all subsequent transaction log backups, up to and including the tail-of-the-log backup (L7 and L8).
One of the tricky problems when planning a restore sequence is finding the earliest required transaction log backup to restore (sometimes called finding the "minimum-LSN," or "minimum-Log Sequence Number"). In the example in Figure 1, only transaction log backups L7 and L8 are required, because the differential database backup D2 brings the database to a more recent point in time than all the previous transaction log backups.

Figure 1: Example Restore Sequence
SQL Server will allow previous, unneeded transaction log backups to be restored, but they will not be used and essentially just waste disaster-recovery time.Continuing my example, what would happen if the differential database backup D2 was damaged or missing? Figure 2 shows this case.

Figure 2: Restore Sequence with a Damage Differential Database BackUp
In this scenario, the shortest and fastest restore sequence is the most recent full database backup (F), the next most recent differential database backup (D1), and then all subsequent transaction log backups (L4, L5, L6, L7 and L8). This is possible only as long as backups D1, L4, L5 and L6 are still available. It is important that you do not delete backups too soon; otherwise you could run into problems during a disaster.
For instance, if the full database backup F is damaged, unless the previous full database backup is still available, the database will not be recoverable. If the differential database backup D1 is deleted as soon as D2 completes, then the scenario in Figure 2 will not be possible, and the restore sequence will involve all transaction log backups since the full database backup -- a potentially very long restore sequence.
This raises the question of "When should you delete your previous backups?" The answer is definitely "It depends!" If you don't have a legal obligation to keep your backups around for a certain length of time, then it's up to you, and depends on the backup strategy you have and how much disk space is required. Regardless, don't immediately delete previous backups as soon as a new one is taken; it's best to keep at least one or two complete cycles of backups before getting rid of older backups. Ideally, you should test your backups before removing older ones.
For transaction log backups, in general you must have all of them since the last full database backup was taken, as that is the final fall-back restore sequence. If a single transaction log backup from the transaction log backup "chain" is missing or damaged, the restore operation can't proceed past the gap. As I mentioned in the previous article, verifying the integrity of your backups is a key part of being able to restore successfully.
You can find more details on figuring out what you're able to restore in the comprehensive Books Online topic "Working with Restore Sequences for SQL Server Databases".

Example Restore Scenarios

The most common restore scenario involves a full database backup and then one or more transaction log backups to bring the database forward in time. You can do this through SQL Server Management Studio (SSMS) or through Transact-SQL, although there is something you need to be aware of if you're going to use RESTORE commands directly.
When a backup is restored, there are three options for how the restore operation completes and they all relate to the UNDO phase of recovery. As each successive backup in the restore sequence is restored, the REDO phase of recovery is always performed, but the UNDO phase cannot be performed until the very last backup in the transaction log backup chain has been restored. This is because as soon as recovery is completed, no further transaction log backups can be applied, so all restores in the restore sequence must specify not to run the UNDO phase of recovery.
The default, unfortunately, is to run the UNDO phase of recovery -- the equivalent of using the WITH RECOVERY option on the RESTORE statement. When restoring multiple backups, you must be careful that each one specifies WITH NORECOVERY. In fact, the safest way is to use the WITH NORECOVERY option on all restores in the restore sequence, and then manually complete recovery afterward. Here is some example Transact-SQL code to restore a full database backup and two transaction log backups, and then manually complete recovery to bring the database online:
RESTORE DATABASE DBMaint2008 FROM
DISK = 'C:\SQLskills\DBMaint2008_Full_051709_0000.bak'
WITH REPLACE, CHECKSUM, NORECOVERY;
GO
RESTORE LOG DBMaint2008 FROM
DISK = 'C:\SQLskills\DBMaint2008_Log_051709_0100.bak'
WITH NORECOVERY;
GO
RESTORE LOG DBMaint2008 FROM
DISK = 'C:\SQLskills\DBMaint2008_Log_051709_0200.bak'
WITH NORECOVERY;
GO
RESTORE DATABASE DBMaint2008 WITH RECOVERY;
GO
Notice that I also used the CHECKSUM option on the restore of the full database backup to ensure that any page checksums present in the database being restored are verified as they are restored.
If WITH NORECOVERY was not specified on the first RESTORE statement, the following error is returned:
Msg 3117, Level 16, State 1, Line 1
The log or differential backup cannot be restored because no files are ready to rollforward.
Msg 3013, Level 16, State 1, Line 1
RESTORE LOG is terminating abnormally.
You must be very careful to use the right option, otherwise you risk having to start a long restore sequence again -- there's no way to undo recovery once it's completed.
There is, however, an interesting option which kind of does that -- the WITH STANDBY option. This is the last of three options I mentioned earlier. It works by running the UNDO phase of recovery, but it keeps a note of what it did (in an "undo" file whose name and path you specify) and then allows read-only access to the database. The database is transactionally consistent, but you have the ability to continue with the restore sequence. If you decide to continue, the UNDO is reversed (using the contents of the undo file) and then the next transaction log file is restored. This is useful in two scenarios: for allowing read-only access to a log-shipping secondary database and for looking at the contents of the database during the restore sequence.
If the disaster you're recovering from involves accidental deletion of a table, for instance, you may want to do a point-in-time restore. There are several ways to do this, but the most common is where you want to restore the database but ensure that recovery does not proceed past a certain time. In that case you can use the WITH STOPAT option to prevent transaction log restore from going past the time you know the table was deleted. For instance, using the Transact-SQL example above, if I wanted to prevent the database from being restored past 1:45 a.m., I could use the following syntax on the second RESTORE LOG statement:
RESTORE LOG DBMaint2008 FROM
DISK = 'C:\SQLskills\DBMaint2008_Log_051709_0200.bak'
WITH NORECOVERY, STOPAT = '2009-05-17 01:45:00.000';
GO
I could even combine STOPAT and STANDBY to see whether that was the correct point in time and then, if not, restore the same transaction log backup with a time a few seconds later and so on. This kind of operation becomes very tedious, but it may be the only solution if you don't know what time an operation took place.
A comprehensive discussion of these and other options for the RESTORE statement can be found in the Books Online topic "RESTORE Arguments (Transact-SQL)".
One of the coolest new features introduced in SQL Server 2005 Enterprise Edition was partial database availability. This feature allows a multi-filegroup database to be online and available as long as at least the primary filegroup is online. Obviously, data in any offline filegroups can't be accessed, but this feature allows a very large database to be split into separate filegroups for easier and faster recoverability. Another Enterprise-only feature that was added is the ability to perform piecemeal restores (for instance, a single filegroup from a multi-filegroup database) online, while the rest of the database is being used for processing.
These two features combined enable some quite sophisticated and efficient restore scenarios, as long as the database has been architected that way and the correct backups exist.
You'll find an excellent, in-depth SQL Server Technical Article titled "Microsoft SQL Server 2005 Partial Database Availability" with some extensive examples available at tinyurl.com/mbpa65. There's also a 75-minute recording of Kimberly L. Tripp delivering a Tech-Ed EMEA session titled "SQL Server 2005 VLDB Availability and Recovery Strategies" that is well worth watching.

Considerations When Restoring to a Different Location

The simplest restore scenario is when the database is being restored on the same SQL Server instance to which it is usually attached, and with the same name. As you move further away from that scenario, the aftermath of the restore operation becomes more complicated.
If the database is being restored on the same instance, but with a different name, you may need to make changes to elements like DTS/SSIS packages, database maintenance plans, application strings and anything that relies on a database name.
If the database is being restored on a different instance on the same server, things get a lot more complicated:
  • The SQL Server logins will be different or may not exist.
  • SQL Agent jobs and DTS/SSIS packages will be different or may not exist.
  • The master database is different, so any user-defined stored procedures may be missing.
  • The SQL Server instance name will be different, so there may be client connectivity issues.
  • If the database is being restored on an instance on a different server, everything listed applies, but there may be added security issues as Windows accounts may be different, and they may be in a different Windows domain.
  • One other consideration is the edition of SQL Server the database is being restored on. There are some features that, if used in the database, make the database "Enterprise-only" -- it cannot be restored on a Standard Edition (or lower) SQL Server instance.
  • In SQL Server 2000 and earlier, this is not an issue. In SQL Server 2005, if table or index partitioning is used, the database is "Enterprise-only." In SQL Server 2008, the feature list is:
  • Change data capture
  • Transparent data encryption
  • Data compression
  • Partitioning
All of these require sysadmin privileges to enable except data compression, which can be enabled by a table owner, thus potentially breaking a disaster-recovery plan involving restoring to a Standard Edition instance. You can tell whether any of these features are being used in the database using the DMV sys.dm_db_persisted_sku_features and adjust your disaster-recovery plan accordingly.

Dig Deeper

Just as with the first article in the series on backups, there are lots of facets of restore operations that I didn't have space to cover. Now that you know the basics, however, you can dive into some of the Books Online and blog links for deeper information. The best place to start in Books Online is the topic "Restore and Recovery Overview (SQL Server)". You can also find a lot of information on my blog, starting with the Backup/Restore category.
The main takeaway I'd like you to gain from this article is that to successfully recover a database using backups, you need to practice to make sure you know what to do. You don't want to be learning the syntax for the RESTORE command during a high-pressure disaster-recovery situation. You may also find that your backup strategy does not allow you to recover within your business requirements. Maybe the backups take too long to restore, maybe your log backups are accidentally overwriting each other, or maybe you forgot to back up the server certificate used to enable transparent database encryption in SQL Server 2008.
By far the best way to prepare for disaster is to have a restore plan that lists the steps to go through, and to have a set of scripts that will help identify what backups exist and the order in which to restore them. I always like to say that this should be written by the most senior DBA on the team and tested by the most junior DBA -- to ensure that everyone can follow the steps safely. However, if you're an involuntary DBA, you're going to need to put a plan together yourself and make sure you can follow it.
In the next article, I'll explain how to recover from database corruption if you don't have any backups, and why you might choose to run a repair operation even if you do have backups.
In the mean time, and as always, if you have any feedback or questions, drop me a line -- Paul@SQLskills.com.

Thanks to Kimberly L. Tripp for providing a technical review of this article.
Paul S. Randal is the managing director of SQLskills.com, a SQL Server MVP and Microsoft regional director. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. Randal wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Randal is an expert on disaster recovery, high availability and database maintenance and is a regular presenter at conferences around the world. He blogs at SQLskills.com/blogs/paul and is on Twitter as @PaulRandal.