Oracle DBA Scratchpad: Protecting, Backing up and Recovering the Control File

1.0 Introduction
Since it is usually only a few megabytes in size and doesn't contain important tables or indexes, the Control File usually gets the Cinderella treatment when it comes to protective maintenance and backup operations: sometimes overlooked entirely, otherwise casually dealt with as of no especial importance.
This article hopes to set the record straight: the Control File is one of the most important parts of an Oracle database; it needs protecting and it needs to be properly backed up. And should the day come that it ever needs recovering, you will need to know what you're doing just as much as if it had been the SYSTEM tablespace that had been lost or corrupted.
2.0 Multiplexing the Control Files
If you corrupt or lose your only functioning control file, you are in trouble. Hardware mirroring will protect your control file from that sort of problem to an extent: if a hard disk or disk controller fails, it is not the end of the world, because the other hard disk or controller is still functioning. But hardware mirroring only protects you from hardware failure: it doesn't help when the problem is a colleague who can't quite work out what the rm * command does, nor when it is one of the background processes who introduce corruption into the control file during one of their writes (rare, but it happens). In such cases of user or software madness, the hardware mirror will faithfully replicate the disaster onto the other hard disk, and you end up with zero functioning control files.
Hardware mirroring is good, therefore... but not sufficient on its own.
Instead, you want to multiplex your control files. That means making Oracle write to, read from and maintain two (or more) identical copies of the control file. If a user deletes one copy, a second copy remains. If Oracle writes badly to one copy, thus introducing corruption, it is unlikely to do so as it writes to the second file. Multiplexing your control files should be considered compulsory, therefore. By all means use hardware mirroring to protect your two (or three) control files from hardware failure, but multiplex first and mirror second.
To multiplex your control files, you simply need to tell Oracle that there are two or three copies for it to maintain. That's handled by modifying the CONTROL_FILES parameter in the init.ora/spfile.ora. You also need to ensure that if you've said there's a second or third copy of the file to maintain, those extra copies actually exist in the locations you specified. There's no automatic way of doing that: it's just down to using your operating system's copy command. However, you can't yourself copy a control file if it's in use: so the copy operation will have to be performed when the database has been cleanly shut down.
Summarising all those long-winded considerations, the process can simply be listed as follows:
1.    Edit the parameter file so that CONTROL_FILES mentions a second (or third) location
2.    Shut down the database cleanly (shutdown immediate is fine, shutdown abort isn't)
3.    Copy the existing control file into the locations specified in the modified CONTROL_FILES parameter
4.    Startup the database
The order of events here is very important if you are using an spfile rather than an init.ora to configure your instance: editing an spfile uses an instance to process the necessary alter system commands, so you have to shutdown the instance after you've edited the file. But if you're using an init.ora, you could pretty much vary the order at will: shutdown -> copy -> edit -> startup would work just fine in that case, because editing an init.ora can be done with any old text editor at any time. To avoid having to learn 53 different possibilities, though, I'd suggest you stick with the ordering I've shown above. It will work unchanged regardless of whether you're using an spfile or an init.ora.
A short example will help clarify the procedures, I hope. First, let's establish (a) whether I've already multiplexed my control file and (b) whether I'm using an spfile or not:
SQL> select name from v$controlfile;

NAME
--------------------------------------------------------------------------------
/oracle/oradata/lx10/control01.ctl

SQL> show parameter spfile;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string      /oracle/10g/dbs/spfilelx10.ora
The first query shows me only a single control file. It might be mirrored, but it's certainly not multiplexed. The second query tells me I'm using an spfile, so I must do my edit of CONTROL_FILES before I shut down my instance:
SQL> alter system set control_files=
2 '/oracle/oradata/lx10/control01.ctl',
3 '/oracle/oradata/lx10/control02.ctl'
4 scope=spfile;

System altered.
The command here simply lists each control file path and filename in a series of quoted, comma-separated entries. If you were editing an init.ora to achieve the same thing, you can omit the quotation marks, so that the parameter value ends up just being a set of comma-separated path and filenames:
control_files=/mydb1/control01.ctl,/mydb2/control02.ctl,/mydb3/control03.ctl
...and so on. Either way, once whatever parameter file you're using has been successfully edited, you can now proceed to take a cold copy of the existing control file:
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.

SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options

$ cp -p /oracle/oradata/lx10/control01.ctl /oracle/oradata/lx10/control02.ctl
Note that I used the -p switch there. This ensures that the copied file has the same permissions as the original, which is necessary if the instance is going to have the same level of access to the copy as it has to the original. You need to be concerned about that on Windows machines too, of course, though the specifics of how you go about making it so are different.
Anyway, with the copy in place, I can now re-start my instance and check the results of my efforts:
[oracle@pertinax ~]$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Aug 31 10:16:34 2006
opyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size                  1218316 bytes
Variable Size              79694068 bytes
Database Buffers           83886080 bytes
Redo Buffers                2973696 bytes
Database mounted.
Database opened.

SQL> select name from v$controlfile;

NAME
--------------------------------------------------------------------------------
/oracle/oradata/lx10/control01.ctl
/oracle/oradata/lx10/control02.ctl
Good news: If V$CONTROLFILE reports that more than one control file exists, that's successful multiplexing. I should, however, point out that it is singularly daft to have multiple copies of your control file on the same disk, let alone in the same directory, as I have it here! In my defence, I've only been that daft on this occasion in the interests of keeping the example short and simple!
You also need to be careful to ensure that when you multiplex, you multiplex not just into different directories but actually onto different physical hard disks -preferably, even completely different hard disk controllers. I've seen, for example, a number of databases on Windows (though it's not just a Windows problem) that have carefully multiplexed their controlfiles onto the D: and E: drives. Unfortunately, D: and E: were both merely partitions of the same physical hard disk, so the effort was rather wasted! As my little example demonstrates, though, Oracle won't check that your multiplexing scheme is not foolish, and neither will it complain if it is!
3.0 User backups of the Control File
3.1 Binary backups
Even though you've now multiplexed your control file, you should back it up regularly. You cannot, however, back it up simply by copying the file(s) using operating system commands... not, at least, if the database is up and running. If you do, you might get lucky, but rather more probably you will end up copying an internally inconsistent copy of the file which is useless for the purposes of future recoveries. Control Files cannot reliably be O/S copied whilst 'hot'.
Instead, you can produce a hot copy of the control file using nothing more demanding that a simple SQL statement:
alter database backup controlfile to '/home/oracle/controlbkp.ctl';
This causes Oracle momentarily to lock the control file so that it can be certain nothing is actually modifying it -in a sense, it turns the file very briefly 'stone cold'. Then your server process can safely read the existing control file and is then able to output a clean, consistent copy in the directory and with whatever file name you specified. It's all over and done with in about 10 seconds.
A comparison of the original control file with the fresh backup I've just produced shows this:
$ ls -l /oracle/oradata/lx10/*.ctl
-rw-r----- 1 oracle oinstall 7061504 Aug 31 10:25 /oracle/oradata/lx10/control01.ctl
-rw-r----- 1 oracle oinstall 7061504 Aug 31 10:25 /oracle/oradata/lx10/control02.ctl

$ ls -l /home/oracle/*.ctl
-rw-r----- 1 oracle oinstall 7061504 Aug 31 10:25 /home/oracle/controlbkp.ctl
You will note that both files are identical in size (in this case, about 7MB each). This should tell you that the 'alter database backup controlfile to '/somewhere/' command produces a byte-for-byte binary backup of the control file: an identical copy, in other words. This is a good thing, of course: it's what you tend to expect from a backup utility, after all. But it poses an issue for future recoveries that use this backup, as I'll now explain.
One of the major jobs of the control file is to record the 'age' of the database. It does this by storing the latest checkpoint change number, written there periodically by the CKPT process. You can see this number increment all the time by repeatedly querying the V$DATABASE view, which takes most of its contents directly from the control file:
SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#
------------------
          629499
A binary backup of the control file therefore also stores this particular 'age of database' marker, perfect and exact at the time the backup was made... but it will be immediately out-of-date, because the database will continue being used. Its checkpoint change number will therefore continue to be incremented, but that stored in the backup remains static. Were you therefore ever to restore the backup control file, your data files would be marked with a checkpoint change number much later than the restored control file claims is legitimate -and that's a conflict which is not easily fixed. (For the record, though I'll be discussing this later anyway in more detail, the fix would be to go through the motions of an incomplete recovery so that you can issue a resetlogs command: recover database using backup controlfile until cancel -> cancel -> alter database open resetlogs).
Using the binary backup of the control file in a recovery situation is therefore a little problematic because it involves forcing a resetlogs on the database. Resetlogs are not good things to do, if you can avoid them, because (at least until 10g) they render all prior backups and archives largely useless and therefore make a new, closed whole database backup fairly essential... which is a lot of downtime and hassle.
Don't let that put you off taking binary backups of the control file, though. They are certainly usable in a lost controlfile recovery scenario, even if the exact recovery process is a little fiddly and potentially expensive. Importantly, only binary backups of the control file protect the RMAN metadata that is recorded in the control file of the database whenever the database as a whole is backed up using RMAN. If you're not using an RMAN catalog, therefore, binary backups of the control are vital.
3.2 Trace File backups
An alternative to taking a binary backup of the control file is to get the instance to generate a script which would re-create a control file. This is usually referred to as the 'trace file backup' of the control file. It is easily done with another simple SQL command:
alter database backup controlfile to trace noresetlogs;
The command simply outputs a user trace file (that is, in whichever directory is pointed at by the USER_DUMP_DEST initialisation parameter) that contains within itself a SQL statement which, if executed, would re-construct the control file. The name of the trace file produced in this way is a little obscure: it will contain the process ID of the server process you connected to when you began your SQL*Plus session. You can find out that information by issuing this query:
select p.spid from v$process p, v$session s
where s.paddr=p.addr and s.username=(select sys_context('USERENV','CURRENT_USER') from dual);
When I do that, I get this result:
SQL> select p.spid from v$process p, v$session s
2 where s.paddr=p.addr and
3 s.username=(select sys_context('USERENV','CURRENT_USER') from dual);

SPID
------------
12297
...and I can therefore be confident that my trace file will be called _ora_12297.trc -the naming convention may be different for different operating systems, but the presence of the process number in the file name somewhere is guaranteed. It is obviously important to know how to precisely locate the trace file by name, otherwise there's not much point in producing it in the first place!
If you open the trace file in your favourite text editor, you will find that the first 30 or 40 lines are really a lot of old gumph: technical stuff and comments which are not actually very useful. You can actually just delete all of that, since you'll never need it again. The really useful stuff begins with these magic words:
STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "HJR" NORESETLOGS ARCHIVELOG
-- SET STANDBY TO MAXIMIZE PERFORMANCE
    MAXLOGFILES 16
    MAXLOGMEMBERS 2
    MAXDATAFILES 302
    MAXINSTANCES 1
    MAXLOGHISTORY 113
LOGFILE
GROUP 1 (
    '/mydb/log01a.rdo',
    '/mydb/log01b.rdo'
) SIZE 10M,
[and so on]
This is the SQL statement which will re-create the control file. The hard-coded paths to the database's online redo logs and data files are needed so that it can be written back into the freshly-constructed control file . The new control file can therefore properly carry out its essential job of telling the instance where all the database files are located.
The real beauty of this script is that using it in a recovery situation requires no use of the expensive "resetlogs" command: instead of restoring a copy of the control file with an out-of-date checkpoint change number, this script tells us how to find the data files and redo logs that make up the database. Knowing these locations, we can visit each file in turn and ask each of them what checkpoint change number they are aware of. We can then write the highest checkpoint change number found into the newly-created controlfile, and thus the new control file starts out in complete agreement with the rest of the database as to the age of the database. Since there is no disagreement between the data files and the control files, there is no need for a fake recovery, and hence no need for a resetlogs.
The catch, though, is: the trace script isn't truly a backup of the control file, in the sense of being a faithful replica of it. It's merely a recipe for re-creating one... and it happens to be an incomplete recipe, in that nowhere does this script contain instructions on how to re-create the RMAN metadata that might have been stored in the original control file. Whilst it's cheaper to recover using this script, therefore (because there's no resetlogs), it might not bring everything back the way you'd like it to (there are potentially lost RMAN records). On the other hand, if you are manually backing up the control file at all, it rather suggests that you aren't using RMAN in the first place -and therefore this concern may not actually be relevant to you.
But the real point here is that the two user-managed control file backup methods are complementary. Each does something the other does not; each has advantages; each has significant drawbacks. In short, you should be doing both, regularly (I suggest nightly).
I should finally just mention that, being a text file and thus editable using nothing more exciting than vi or notepad, the trace file backup of the control file is extremely useful in situations other than control file recovery. Cloning a database, for example, can make use of the script to create a new control file which points to copies of data files rather than the originals. I explain in detail how you go about doing this in the first part of this cloning a database article.
4.0 RMAN backups of the Control File
If you backup your database as a whole with RMAN, then the rules regarding whether the control file gets included in those backups are quite simple:
·    if you 'touch' file 1 (that is, SYSTEM) the control file will be included in the backup
·    if you don't touch file 1, the control file will NOT be included in the backup unless you explicitly ask it to be
·    if you switch on the 9i and above autobackup feature, the control file will always be included in the backup even without your explicit request
By way of a little demonstration of these rules, I offer this:
[oracle@pertinax ~]$ rman target /

Recovery Manager: Release 10.2.0.1.0 - Production on Thu Aug 31 10:32:30 2006

Copyright (c) 1982, 2005, Oracle. All rights reserved.

connected to target database: LX10 (DBID=1253134393)

RMAN> backup tablespace system;

Starting backup at 31-AUG-06
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=142 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00001 name=/oracle/oradata/lx10/system01.dbf
channel ORA_DISK_1: starting piece 1 at 31-AUG-06
channel ORA_DISK_1: finished piece 1 at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
channel ORA_DISK_1: backup set complete, elapsed time: 00:01:36
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
including current control file in backupset
including current SPFILE in backupset
channel ORA_DISK_1: starting piece 1 at 31-AUG-06
channel ORA_DISK_1: finished piece 1 at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:05
Finished backup at 31-AUG-06
You'll note the line I've highlighted here: because the backup touched the system datafile, the control file was included in the backup (as was the spfile, incidentally). Now let's see what happens when the backup doesn't touch file 1:
RMAN> backup tablespace users;

Starting backup at 31-AUG-06
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00004 name=/oracle/oradata/lx10/users01.dbf
channel ORA_DISK_1: starting piece 1 at 31-AUG-06
channel ORA_DISK_1: finished piece 1 at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 31-AUG-06
You won't find mention of the control file in that example, because I've not included SYSTEM in the backup and neither have I explicitly asked to include the control file. If I wanted to do that, the backup command would look more like this:
RMAN> backup tablespace users include current controlfile;

Starting backup at 31-AUG-06
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00004 name=/oracle/oradata/lx10/users01.dbf
channel ORA_DISK_1: starting piece 1 at 31-AUG-06
channel ORA_DISK_1: finished piece 1 at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
including current control file in backupset
channel ORA_DISK_1: starting piece 1 at 31-AUG-06
channel ORA_DISK_1: finished piece 1 at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:02
Finished backup at 31-AUG-06
This time, the control file is included (but not, you'll notice, the spfile).
Finally, I mentioned the AUTOBACKUP feature introduced in 9i. It's easily configured, and its effects are pretty obvious:
RMAN> configure controlfile autobackup on;

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP ON;
new RMAN configuration parameters are successfully stored

RMAN> backup tablespace users;

Starting backup at 31-AUG-06
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00004 name=/oracle/oradata/lx10/users01.dbf
channel ORA_DISK_1: starting piece 1 at 31-AUG-06
channel ORA_DISK_1: finished piece 1 at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:02
Finished backup at 31-AUG-06

Starting Control File and SPFILE Autobackup at 31-AUG-06
piece handle=/oracle/flash_recovery_area/
Finished Control File and SPFILE Autobackup at 31-AUG-06
Configuring any RMAN setting in 9i or 10g causes that configuration change to be permanently stored in the control file, so once this autobackup option has been configured it will remain in force without further effort on your part -at least, until you configure the option back to its default state of 'OFF'. The effect of the setting is to cause a control file (and spfile) backup to be taken after any backup command is issued, whether or not that backup affected the SYSTEM tablespace. In this case, without my having to ask for it with an 'include current controlfile' command, backing up EXAMPLE was enough to get the control file backed up too.
The name of the backups created by this autobackup mechanism are important: they must use a substitution variable, %F. You can tailor and tweak a backup name any way you like, but that %F must be in the name somewhere. In other words, this command will fail:
RMAN> configure controlfile autobackup
2> format for device type disk to '/home/oracle/controlbk%U.ctl';

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of configure command at 08/31/2006 10:41:38
RMAN-06492: controlfile autobackup format "/home/oracle/controlbk%U.ctl"
            must specify a "%F" format specifier
The %U would generate a unique identifier for each backup, but it's not good enough for the control file autobackup: the %F is compulsory, as the error message tells you. This very slightly different command therefore works fine:
RMAN> configure controlfile autobackup
2> format for device type disk to '/home/oracle/controlbk%U_%F.ctl';

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT
FOR DEVICE TYPE DISK TO '/home/oracle/controlbk%U_%F.ctl';
new RMAN configuration parameters are successfully stored
Because the essential variable is in the proposed backup name somewhere, it's good enough for RMAN and the configuration command succeeds.
There's no good place to make this next assertion, so I'm just going to drop it in cold at this point: switching on the autobackup control files feature in Oracle 9i or 10g should be considered compulsory, especially if you are running RMAN without a recovery catalog (which is the default way of running RMAN). If you keep reading until Section 6 of this article, you'll see why this statement is true -but it ultimately comes down to this: without autobackup switched on and without a recovery catalog, you cannot always guarantee to be able to recover your database.
One way or another, therefore, getting RMAN to backup your control file is relatively easy. Bear in mind that no matter how many times you've multiplexed your control file, there will only be one backup file per backup command. Also note that all RMAN backups of the control file are binary in nature: there is no RMAN command to produce the equivalent of the manual tracefile backup (apart from the obvious sql 'alter database backup controlfile to trace'; one)
5.0 Manually Recovering the Control File
If you are unfortunate enough ever to lose a control file, your database will crash. Even if your control file is multiplexed sixteen different ways, the loss (or corruption preventing subsequent successful writes) of one of them will cause your database to crash. Multiplexing is not there to provide redundancy ("if I lose one, I have another 15 that can still be used"): if you say you have 16 control files, you must always have 16 good control files available for read and write access.
The vagaries of Linux and Unix file systems can sometimes make these statements seem false. For example:
SQL> !rm /oracle/oradata/lx10/control01.ctl

SQL> alter system checkpoint;
System altered.

SQL> alter system switch logfile;
System altered.
Here, I delete one of my control files from the operating system, yet the database continues to run, apparently oblivious to the "rule" which says it should crash -even when I force the database to perform operations which require reading from and writing to the control files. This is not, though, a breach of the rule. It's a result of the file system's use of inodes: whilst the file appears to you and me to have disappeared as a result of the rm command, the instance still holds locks on the file system inodes the file was using, and therefore as far as it is concerned, it still 'sees' the file and can use it as normal. The fun will start when the instance is forced to release the file system inodes -because when it tries to re-aquire them, it will suddenly notice that the file's really not there any more:
SQL> startup force
ORACLE instance started.

Total System Global Area 107450720 bytes
Fixed Size                   450912 bytes
Variable Size              79691776 bytes
Database Buffers           25165824 bytes
Redo Buffers                2142208 bytes
ORA-00205: error in identifying controlfile, check alert log for more info
I draw this weird behaviour to your attention so that you understand that when you're testing "loss of control file" disaster scenarios, you must bounce your instance for the damage you've hoped to inflict to become tangible and evident. On Windows, the problem arises in a different form, because if you try to delete a control file whilst the an instance is still running, the delete will fail with a 'sharing violation' warning. Even when your instance has been shut down, you may find sharing violations occur which prevent you from deleting one of the control files. In such cases, you'll have to stop the Windows service which provides the memory and other system resources that an instance on Windows needs.
5.1 Simple Recovery
Anyway, right now, I have been told that the instance can't identify one of my control files -though, unfortunately, it doesn't say which one it can't identify (I know we know which one, because I just deleted it! But in the real world, you will be left guessing, and need to know how to find out which control file is causing trouble). The error message does go on to say, however, that I should check alert log for more info. When I do so, here's what I find:
SQL> show parameter background_dump_dest

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
background_dump_dest                 string      /oracle/admin/lx10/bdump

SQL> !

[oracle@pertinax ~]$ tail -20 /oracle/admin/lx10/bdump/alert_lx10.log
LGWR started with pid=6, OS id=3284
CKPT started with pid=7, OS id=3286
SMON started with pid=8, OS id=3288
RECO started with pid=9, OS id=3290
CJQ0 started with pid=10, OS id=3292
Thu Aug 31 11:22:33 2006
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=12, OS id=3296
MMON started with pid=11, OS id=3294
Thu Aug 31 11:22:33 2006
starting up 1 shared server(s) ...
Thu Aug 31 11:22:34 2006
ALTER DATABASE   MOUNT
Thu Aug 31 11:22:34 2006
ORA-00202: control file: '/oracle/oradata/lx10/control01.ctl'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Thu Aug 31 11:22:37 2006
ORA-205 signalled during: ALTER DATABASE   MOUNT...
First, I refresh my memory about where my alert log is to be found (always in whatever directory your BACKGROUND_DUMP_DEST parameter points to). Then I have a look at the last few lines of the alert_.log file found there. The full path and filename of the missing control file is there revealed, and I can plan my recovery strategy accordingly.
In this case, the earlier multiplexing of my control file can serve its intended purpose: although the control01 file is missing, there's nothing wrong with the control02 file... so a simply copy from one to the other should fix things up:
[oracle@pertinax ~]$ cp -p /oracle/oradata/lx10/control02.ctl /oracle/oradata/lx10/control01.ctl
[oracle@pertinax ~]$ exit
exit

SQL> startup force
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size                  1218316 bytes
Variable Size              79694068 bytes
Database Buffers           83886080 bytes
Redo Buffers                2973696 bytes
Database mounted.
Database opened.
...and once the database reports that it is open, you know that the recovery mechanism has worked. This is the entire point of multiplexing control files. Having multiple copies of the thing doesn't mean 'if one is lost, we can keep using the others'. It is not operational redundancy of that sort that multiplexing brings. Rather, it gives you a 'if one is lost, there's another I can copy to replace it", and that's the simplest and easiest recovery method it's possible to imagine.
5.2 Total Recovery without Resetlogs
My last recovery was easy because I had multiplexed my control files, and there was one good copy left with which to replace a damaged copy. You might not always get so lucky:
SQL> !rm /oracle/oradata/lx10/control*

SQL> startup force
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size                  1218316 bytes
Variable Size              79694068 bytes
Database Buffers           83886080 bytes
Redo Buffers                2973696 bytes
ORA-00205: error in identifying control file, check alert log for more info
This time, my 'delete file' command used a wild card, and therefore all my control files are lost. That's never supposed to happen, of course -but it has been known when multiplexed copies all end up residing on one hard disk. It's also been known for Oracle's own background processes to corrupt all copies of the control file (though that's very rare).
The alert log won't help you on this one: it will only report that the first of the control files is missing (because having found one is missing, it doesn't bother checking whether the others are there or not). Worse, you now have nothing to replace the missing files with:
[oracle@pertinax ~]$ ls /oracle/oradata/lx10/
redo01.log redo02.log redo03.log sysaux01.dbf system01.dbf temp01.dbf undotbs01.dbf users01.dbf
Data files and redo logs I have in abundance, but not a control file in sight! But no matter, for earlier I took a trace script backup of the control file, and that trace file is still available to me (in whatever directory happens to be pointed at by the USER_DUMP_DEST initialisation parameter). To convert that trace file into a recovery script, I simply strip out anything before the first line which reads STARTUP MOUNT and anything after the first line which reads end of tempfile additions. You can save the modified trace file with any filename you like, but I tend to call mine createcon.sql. Once the trace script has been knocked into usable shape, you simply execute it as you would any SQL script, making sure to begin running it when no instance already exists:
SQL> shutdown abort
ORACLE instance shut down.
SQL> @/oracle/admin/lx10/udump/createcon.sql
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size                  1218316 bytes
Variable Size              79694068 bytes
Database Buffers           83886080 bytes
Redo Buffers                2973696 bytes

Control file created.

Media recovery complete.

Database altered.

Tablespace altered.
I should mention that 10g Release 2 is the first version of Oracle which doesn't display all sorts of nasty error messages across your screen when the modified trace script is run. That's because in 9i (for example), the script tries to comment out lines with a "#", which is not a valid comment character for SQL*Plus. Those comment lines are therefore processed as actual SQL statements, and inevitably errors and error messages arise as a result. In 10g, Oracle finally got the comment characters right and consistent. Don't panic if you see the error messages, though: so long as you see the magic words control file created, the script works fine, no matter what version of Oracle you're using.
Although the message says "control file created", implying only one has been re-constructed, the trace script does actually bring back all copies of the control file mentioned in the init.ora or spfile.ora:
[oracle@pertinax ~]$ ls -l /oracle/oradata/lx10
total 1028336
-rw-r----- 1 oracle oinstall   7389184 Aug 31 11:48 control01.ctl
-rw-r----- 1 oracle oinstall   7389184 Aug 31 11:48 control02.ctl
-rw-r----- 1 oracle oinstall 52429312 Aug 31 11:43 redo01.log
-rw-r----- 1 oracle oinstall 52429312 Aug 31 11:43 redo02.log
-rw-r----- 1 oracle oinstall 52429312 Aug 31 11:48 redo03.log
-rw-r----- 1 oracle oinstall 199237632 Aug 31 11:44 sysaux01.dbf
-rw-r----- 1 oracle oinstall 450895872 Aug 31 11:44 system01.dbf
-rw-r----- 1 oracle oinstall 20979712 Aug 31 11:44 temp01.dbf
-rw-r----- 1 oracle oinstall 209723392 Aug 31 11:44 undotbs01.dbf
-rw-r----- 1 oracle oinstall   5251072 Aug 31 11:44 users01.dbf
The neat thing about the trace script recovery method is that it does very little collateral "damage". That is, if you issue this command:
SQL> archive log list
Database log mode              No Archive Mode
Automatic archival             Disabled
Archive destination            USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence     22
Current log sequence           24
...you can see that the log sequence numbers of my database are still very much in the 20+ range. They haven't been reset to zero, in other words. That means that this recovery has not performed a resetlogs operation, the significance of which will become apparent in a few paragraph's time.
This command also neatly highlights an unintended consequence of recovering or re-creating a control file, rather than merely copying a good multiplexed copy back to where a lost copy used to be. It shows that my database is in No Archive Mode. Fair enough... but this database had actually been put into archive log mode, as evidenced by this:
[oracle@pertinax ~]$ ls /oracle/flash_recovery_area/LX10/archivelog/2006_08_31/
o1_mf_1_22_2hdgsb48_.arc
That's an archive redo log with sequence number 22, so my database must once have been in archive log mode, though it is no longer. What is going on?! Well, it so happens that my trace script backup of the control file was taken when my database was not in archivelog mode. When the script it is used to re-construct a lost control file, it re-creates it with the characteristics that it knew about at the time it was originally produced... and in my case, that means going back to not being in archive log mode.
This is a generic problem with re-creating or bring back old copies of control files. The control file you get back might not describe the database you now have. It might be missing references to data files and tablespaces which were created after the backup was taken, for example; or as here, it might be describing an operational aspect of the database which is no longer valid. Archive log mode is one such aspect; whether a file is online/offline, read-only or read-write is another. Such discrepanices between what you've got and what your recovered control file knows about can be fixed, so long as you remember it might be an issue. In my case, I just have to follow the standard procedures for putting a database back into archive log mode. In the case of a new datafile not recognised by the control file creation script, I could edit the script before executing it. If the recovered control file thinks a file is read-only when it is in fact now read-write, you can mount the database and issue an alter database datafile '/path/filename' read write command. And so on. The point is, you need to be aware of the issue -and hopefully see that robust multiplexing is highly useful in avoiding you ever having to address these sorts of issue.
The other avoidance mechanism you can take is this: any time you make a physical alteration to your database (adding a file, dropiing a tablespace, making something read-only or read-write), take a new backup of your control file. That way, you stand a good chance of recovering a control file from a trace script or binary backup which does correctly describe the database you actually have, and you won't have to manually knock it into shape.
That doesn't fix the final problem with the trace script recovery mechanism, though:
RMAN> list backup of database;

using target database control file instead of recovery catalog

RMAN>
Again, it might not seem very remarkable that I don't have RMAN backups of my database -but in fact I did use RMAN to take backups earlier in this very article. They're even still on disk:
RMAN> host;

[oracle@pertinax ~]$ ls /oracle/flash_recovery_area/LX10/backupset/2006_08_31/
o1_mf_ncnnf_TAG20060831T103638_2hdcrs7j_.bkp o1_mf_nnndf_TAG20060831T103545_2hdcq1tl_.bkp
o1_mf_ncsnf_TAG20060831T103304_2hdco3rw_.bkp o1_mf_nnndf_TAG20060831T103638_2hdcrqfk_.bkp
o1_mf_nnndf_TAG20060831T103304_2hdcl13w_.bkp o1_mf_nnndf_TAG20060831T103731_2hdctd4h_.bkp
The trouble here is that without an RMAN recovery catalog, the control file is the only record of what RMAN backups have been taken. Lose your control file, and you lose that record of backups. Re-create the control file, and no record of prior backups is re-created with it. In 10g, this is not terminal, because you can always manually re-catalog prior backup sets...but it's a pain to do, and isn't possible in 9i, so either way you look at it, the loss of RMAN backup records is not good news. But as I mentioned earlier: if you're doing manual trace file recoveries of control files, it is quite likely you aren't using RMAN in the first place. Conversely, RMAN users would porobably be looking to perform an RMAN control file recovery. So this issue isn't perhaps as serious as you might imagine it to be.
Regardless of the RMAN or 'out of date trace script' issues you might face, however, you shouldn't overlook the essential fact here: reconstructing my entire set of lost control files was simply a matter of doing a bit of text file editing and then running a SQL script. It's a powerful yet relatively simple recovery mechanism in the face of total disaster.
5.3 Total Recovery with Resetlogs
The final manual recovery method for control files involves issuing a resetlogs command, because it depends on the restoration of a binary backup of the control file. The control file copy thus restored from a prior backup is 'out of date' as far as the rest of the database is concerned (specifically, it will think that the sequence of online redo logs is in the (say) 200, 201, 202 area, when in fact the database has moved on since then and is actually using log sequences 320,321, 322 and so on). To force the restored control file's idea of redo sequence numbers to match what the rest of the database knows to be actually the case, you have to issue the command alter database open resetlogs.
Unfortunately, you aren't allowed to issue that command unless Oracle believes you have performed some sort of recovery operation on the database as a whole. Since the loss of control files doesn't necessarily damage any data files in any way, there might very well be no recovery truly required, but you nevertheless have to pretend to recover the database, just so you can persuade Oracle that it's OK to issue the 'open resetlogs' command.
Performing a recovery is done by issuing the command recover database (assuming that the loss of the control files hasn't also caused you to lose any of your online or archived redo). But unfortunately yet again, that command in itself won't do, because a restored control file can't be trusted to reliably know what 'age' the database should be at (that is, what checkpoint change numbers should be found in the headers of all the data files). Left to its own devices, the recovery process would believe the data files are 'wrong' and the control file 'right', when we know it's the other way around. You therefore have to add the words using backup controlfile to the standard 'recover database' command, which effectively tells the recovery process 'trust nothing about database ages you find in the control file, because the control file being used is an old one'.
The full sequence of events for the loss of all control files and their recovery by means of binary backups therefore becomes:
·    copy the control file backups into all the locations mentioned in the CONTROL_FILES parameter
·    open the restored control files by issuing the command startup mount
·    perform a recovery with the command recover database using backup controlfile (assuming a complete recovery) or recover database using backup controlfile until cancel (for an incomplete recovery due to loss of redo)
·    apply as much necessary redo as you can (if any)
·    open the database with an alter database open resetlogs
As ever, a demo might make things rather clearer. First, I'll have to re-break my database:
SQL> !rm /oracle/oradata/lx10/control*

SQL> startup force
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size                  1218316 bytes
Variable Size              79694068 bytes
Database Buffers           83886080 bytes
Redo Buffers                2973696 bytes
ORA-00205: error in identifying control file, check alert log for more info
Next, I locate my control file binary backups, and copy them to where the instance expects to find the 'live' files:
[oracle@pertinax ~]$ ls
controlbkp.ctl Desktop sqlnet.log
[oracle@pertinax ~]$ cp -p controlbkp.ctl /oracle/oradata/lx10/control01.ctl
[oracle@pertinax ~]$ cp -p controlbkp.ctl /oracle/oradata/lx10/control02.ctl
If you simply tried to open the database at this point, you'd fail:
SQL> startup force
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size                  1218316 bytes
Variable Size              79694068 bytes
Database Buffers           83886080 bytes
Redo Buffers                2973696 bytes
Database mounted.
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open
That error comes about because SMON has spotted that whereas the control file thinks the database should be at log sequence X, the online redo logs themselves (the locations of which are stored within the control file itself) are reporting a log sequence of Y. The difference needs to be resolved, and a resetlogs (which forcibly sets the sequence back to 1) is thus called for. You can't, however, simply do what the error message tells you to do:
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/oracle/oradata/lx10/system01.dbf'
This is an absolutely infamous error message -because it's completely and utterly untrue most of the times you ever see it! When attempting to open the database, SMON has now checked the checkpoint change numbers in the headers of each of the data files, and found that they don't match the one stored in the restored control file. We know that it is in fact the control file's checkpoint change number that's out of date, but SMON just sees that the very first data file looks "wrong", and therefore reports that it's all SYSTEM's fault. You can only resolve this issue by telling SMON not to trust what it finds in the control file, and that's the job of the recover database using backup controlfile command:
SQL> recover database using backup controlfile;

ORA-00279: change 435819 generated at 08/31/2006 14:37:23 needed for thread 1
ORA-00289: suggestion :
/oracle/flash_recovery_area/LX10/archivelog/2006_08_31/o1_mf_1_29_%u_.arc
ORA-00280: change 435819 for thread 1 is in sequence #29

Specify log: {=suggested | filename | AUTO | CANCEL}
Here, the database responds with a request to apply redo to the database's data files. In fact, we know that no redo application is necessary, because I've done nothing to damage my data files, but the recovery process doesn't know that and needs to see the redo stream to be certain all is fine. So you play along at this point, pressing [ENTER] every time it wants to inspect a new archive log and letting it read it, before moving on to the next. The archives being inspected are those generated between the time the backup of the control file was taken and the current time of the database, so there could be quite a few of them, in which case just type the word auto at the prompt and let the recovery process automatically scan through each in turn.
Either way, you'll eventually encounter this nasty-looking message:
ORA-00308: cannot open archived log
'/oracle/flash_recovery_area/LX10/archivelog/2006_08_31/o1_mf_1_31_%u_.arc'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/oracle/oradata/lx10/system01.dbf'
Suddenly, the recovery process can't find an archive and throws an error in response. There's a very good reason why it can't find this file though: it hasn't yet been produced! Normally, a control file would tell the recovery process what age the database was at, and thus when to stop looking for archives, but the using backup controlfile clause means that can't happen this time. The recovery process therefore just keeps on guessing 'there must be a next archive' until it eventually runs out of luck.
In fact, the redo that we need to complete this recovery is in one of the online redo logs. That's because I started this section of the article by issuing a startup force command, which begins its work by performing a shutdown abort, thus crashing the instance. Instance recoveries always require the contents of the current online redo log to finish properly -but again, the using backup controlfile clause means the recovery process has no idea whether it's supposed to be looking for an online log or an archived one.
So, to complete this recovery, you have to supply SMON with the information it can't get for itself from an old controlfile, like so:
SQL> select member from v$logfile;

MEMBER
--------------------------------------------------------------------------------
/oracle/oradata/lx10/redo03.log
/oracle/oradata/lx10/redo02.log
/oracle/oradata/lx10/redo01.log

SQL> recover database using backup controlfile;
ORA-00279: change 435933 generated at 08/31/2006 14:38:08 needed for thread 1
ORA-00289: suggestion :
/oracle/flash_recovery_area/LX10/archivelog/2006_08_31/o1_mf_1_31_%u_.arc
ORA-00280: change 435933 for thread 1 is in sequence #31

Specify log: {=suggested | filename | AUTO | CANCEL}
/oracle/oradata/lx10/redo03.log

Log applied.
Media recovery complete.
Here, I first find out where my online redo logs are and what they're called. I then re-launch the recovery mechanism, using exactly the same command as before. When the recovery process asks for the next archive, I feed it the path and filename of the first redo log listed by the earlier query -and get extremely lucky because it just happens to be the right log. The recovery process therefore can apply the very last bit of redo from that file and finally declares that media recovery is complete. If it had been the wrong log, I'd have simply had to repeat the procedure, feeding the recovery process each new log path and name in turn, until I happened to stumble upon the right one. If the redo was completely unavailable (because my online logs were damaged by whatever caused me to lose my control file in the first place), I'd have had to stop trying to apply more redo by using the variant command recover database using backup controlfile until cancel, and then immediately issuing the command CANCEL (which is standard incomplete recovery stuff).
Once you've got the 'media recovery complete' message, the rest is easy:
SQL> alter database open resetlogs;
Database altered.

SQL> select status from v$instance;

STATUS
------------
OPEN

SQL> archive log list
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence     0
Next log sequence to archive   1
Current log sequence           1
I open the database with a resetlogs, and am then able to confirm that the database really is open. But a check on what sequence number my redo logs are up to shows that it is now back to sequence number 1 (if you look carefully at the last archive that the recovery process asked for, you'll notice I was up to sequence 31, so clearly we've gone "backwards"). This is an inevitable consequence of performing a resetlogs operation, for whatever reason. By setting the database's 'clock' back to square 1 like this, the operation effectively renders all prior backups and archives useless -so the very next thing you do at this point, if you have any sense, is to shut the database down and take a new whole, closed database backup. It is true that this advice is less called for in 10g than in 9i. It is even true that in all Oracle versions back to at least 8.0 prior backups and archives could be used if you knew what you were doing and all the necessary prerequisites and caveats could be addressed. But the general principle still remains, even in 10g, that subsequent recoveries are an awful lot easier and a lot less prone to failure if you take a new backup of the entire database at this point.
This recovery method, because of its reliance on the resetlogs command, is therefore a very expensive (and rather tricky) way to recover from the total loss of control files. If you have to do it, you can do it... but trust me: you really don't want to have to do this sort of thing very often at all! Fortunately, if you muultiplex your control files in a robust manner, the need for this type of recovery should almost never arise in the first place.
6.0 Recovering the Control File with RMAN
Finally, I'd like to take a look at how you recover from complete loss of all control files if you want to use RMAN to do the job. Note that, even if you love and adore RMAN as a recovery tool and use it to the exclusion of all other recovery mechanisms, if you lose just one of a multiplexed set of control files, the simplest thing to do is just use O/S commands to copy one of the surviving copies into the space previously occupied by the lost or damaged one (see above, Section 5.1). Only when all your copies are lost or damaged would you really start thinking about using RMAN to sort out the mess.
I'll start this part of the article by doing a bit of clearing up. First, I want to make sure that we start RMAN 'clean', with no records of any prior backups. Then I want to take a standard database backup without the autobackup feature for control files playing any part of the game:

Oracle DBA Scratchpad

Tuesday, October 6, 2009

Protecting, Backing up and Recovering the Control File

No comments:

Post a Comment

Scratchpad Archive

About Me