Tuesday, October 6, 2009

The Consequences of putting a tablespace into Hot Backup mode

Why do I have to put a tablespace into a special hot backup mode when I perform a backup with operating system ‘copy’ or ‘cp’ commands? Why can’t I just copy the files?
An operating system doesn’t understand how Oracle works internally. So when you ask it to copy a database file, it has no idea of the significance of the internal organisation of the file it’s copying, and simply copies whichever pieces of it happen to swing under the disk head from time to time. That is unfortunate for two separate reasons.

First, Oracle uses the first database block within every data file (known as the data file header block) to store that file’s checkpoint change number -and it is this number which acts as the ‘timestamp’ for the entire data file. Since a recovery involves applying redo which was generated after the ‘time’ of a data file, it’s important that the checkpoint change number for a data file faithfully records the earliest age of any part of that data file. We can then roll everything forward from that earliest point, and arrive at a totally consistent file as a whole. But if the checkpoint change number happens to state a time which is later than the actual time of even just one part of the data file’s contents, then Oracle will not know to roll that particular piece of the file forward, and it will be ‘left behind’. You’ll have a corrupt data file on your hands, and an unrecoverable tablespace -and if it’s the SYSTEM tablespace we’re talking about, you’ve just lost your entire database.

For example, suppose my data file’s checkpoint change number reads ‘1000’, but there are actually blocks of data in that file which are actually from time ‘900’. During a recovery, Oracle will think it should apply redo from time 1000 onwards (because that’s what the data file’s header block tells it). But that leaves some blocks very out-of-synch with the rest of the file.

So the data file’s header block is crucial in telling us the point from which we need to start rolling forward. But if you are doing a hot backup, then the instance (specifically, the CKPT background process) may need to update the contents of that block whilst the file is being copied. That would not be a problem if the operating system was smart enough to know that the data file header block should be the first block copied... but, because it doesn’t understand the internal organisation of an Oracle datafile, it doesn’t know that. It may therefore copy that block after CKPT has updated it... and the backup file is therefore immediately rendered useless for recovery purposes.

Now that is our first problem with just copying hot data files: parts of a data file being out of synchronisation with each other, and with an unreliable checkpoint change number in the data file header, mean the data file as a whole is unusable. But the same sort of problem also arises within an Oracle block itself. We tend to think of an Oracle block as a single entity -but it isn’t. Physically, an Oracle block is made up of multiple operating system blocks: for example, an 8K database block is usually comprised of 16 separate O/S blocks (because most O/S blocks are 512 bytes in size). For an Oracle block to be readable by an instance, all 16 of those O/S blocks must be synchronised, or from the same ‘database time’. Unfortunately, if it’s an operating system utility which is copying the database blocks, that utility has no idea that all 16 O/S blocks need to be kept synchronised: it just copies whichever of them happen to turn up under the disk head at any one moment.

But if people are actually modifying the contents of an Oracle block during the copy operation (which is definitely possible during a hot backup, of course), then you could conceivably end up with a copy of what we think of as a single Oracle block that is actually from 16 different database times. Such a block copy is termed a fractured block, and it’s unusable in a recovery scenario.

So if you just copy your datafiles, whilst they’re in use, you will end up with inconsistent block images, and a datafile header block that does not correctly describe the time of the entire datafile. Such a copy is useless.
OK, but what precisely does the begin backup command do to prevent these problems arising?
Two separate things. To deal with the data file header block not accurately describing the earliest time of the contents of the data file, the command causes the header block to be locked from all further updates by the CKPT process. If the data file header block is locked, so that it records the database time from just before it started to be copied, it must tell a valid point from which to start applying redo in a recovery scenario.

For example, if I lock the header block when it contains a checkpoint change number of ‘1000’; and if I then take half an hour to copy the entire file; then there are bound to be blocks within the resulting backup file from time 1100, 1200, 1400 and so on. But there will not be blocks from a time before 1000, and therefore no bits of the file would get ‘forgotten’ during a recovery.

Second: to deal with the fractured block issue, the begin backup command sets a flag which tells Oracle that the first piece of DML to modify any part of a block whilst its file is being copied will cause the entire block to be written to the redo logs. This ensures that whilst the image of the block in the backup file is fractured, and hence useless, there is a clean image of the block in the redo stream -which we can obviously use to replace the fractured version during a recovery operation.
I think I understand. Can you show me these two things happening?
Sure. The locked datafile header block is easiest to demonstrate. First, I issue this piece of SQL:
SELECT T.NAME, H.FILE#, H.CHECKPOINT_CHANGE#
FROM V$TABLESPACE T, V$DATAFILE_HEADER H
WHERE T.TS#=H.TS#;

NAME                                FILE# CHECKPOINT_CHANGE#
------------------------------ ---------- ------------------
SYSTEM                                  1             121923
UNDOTBS                                 2             121923
INDX                                    3             121923
TOOLS                                   4             121923
USERS                                   5             121923
So, all tablespaces (and all data files) start off synchronised with the same checkpoint change number. Now let’s go through the motion of beginning a hot backup and see what happens:
alter tablespace users begin backup;

select t.name, h.file#, h.checkpoint_change#
from v$tablespace t, v$datafile_header h
where t.ts#=h.ts#;

NAME                                FILE# CHECKPOINT_CHANGE#
------------------------------ ---------- ------------------
SYSTEM                                  1             121923
UNDOTBS                                 2             121923
INDX                                    3             121923
TOOLS                                   4             121923
USERS                                   5             121925
Do you notice how the USERS tablespace is now actually slightly ahead of the time the rest of the database has gotten to? That’s because, as a tablespace is put into hot backup, any dirty buffers which belong to that tablespace are flushed by DBWR from the buffer cache back down to disk. CKPT then updates the time in that file’s header block. In short, putting a tablespace into hot backup causes that tablespace (alone) to be checkpointed.

That datafile doesn’t stay ahead of the rest of the database for long, however. If we simulate some transactional activity on the database, we can check the timestamps again:
alter system switch logfile;
alter system switch logfile;
alter system switch logfile;
alter system switch logfile;
At each log switch (which would normally be caused by users continuing to perform transactions, of course), a system-wide checkpoint is issued. So if we now check the data file headers again, we should see some changes:
select t.name, h.file#, h.checkpoint_change#
from v$tablespace t, v$datafile_header h
where t.ts#=h.ts#;

NAME                                FILE# CHECKPOINT_CHANGE#
------------------------------ ---------- ------------------
SYSTEM                                  1             121939
UNDOTBS                                 2             121939
INDX                                    3             121939
TOOLS                                   4             121939
USERS                                   5             121925
You might need to look at those numbers carefully to spot the effect, but if you do, you’ll see that file 5 is no longer ahead of the pack. In fact, it’s still stuck very firmly at timestamp 121925, whilst the other files have moved on to later times. Of course, the checkpoint change number reported here for file 5 is a complete lie in a sense: the contents of file 5 will be just as much at time 121939 as the rest of the database. But the ‘begin backup’ has locked the header so that it reports the time the file started being backed up, not the time of its contents.
{mospagebreak title=Block-sized Redo}
OK. The data file header issue I can see. What about the block-sized redo thing?
That one’s a bit harder to demonstrate. We’ll first have to run an Oracle script that makes it possible to see how much redo a transaction generates. Then we can perform the same transaction when the data file is not being backed up and compare it with when it is.

First, I connect to my database as a normal user -in this case, as Scott. I then run the Oracle-supplied UTLXPLAN.SQL script. And as SYS, I then grant Scott the DBA role as a quick-fix way of allowing Scott to make use of the redo-measuring features that the UTLXPLAN script makes possible:
sqlplus scott/tiger
@%ORACLE_HOME%\rdbms\admin\utlxplan.sql

connect / as sysdba

grant dba to scott;
connect scott/tiger
Now we can test the amount of redo a transaction generates normally:
set autotrace trace stat
update emp set sal=900 where ename='MILLER';

Statistics
-------------------------------------------------------
339 recursive calls
3 db block gets
65 consistent gets
5 physical reads
188 redo size
631 bytes sent via SQL*Net to client
548 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
8 sorts (memory)
0 sorts (disk)
1 rows processed
The set autotrace trace stat command means we get to see a short report about what went on inside Oracle to process the update command I issued. As you can see, the update of Miller’s salary generated 188 bytes of redo. That’s what this particular update generates “normally”. Now, let me put the tablespace where the EMP table lives into hot backup and issue exactly the same update:
alter tablespace users begin backup;
update emp set sal=900 where ename='MILLER';

Statistics
--------------------------------------------------------
0 recursive calls
2 db block gets
3 consistent gets
0 physical reads
8740 redo size
632 bytes sent via SQL*Net to client
548 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1 rows processed
Suddenly, the exact same transaction now generates 8,740 bytes of redo. That’s an extra 8,552 bytes, or a 4500% increase! The reason is not hard to find:
show parameter db_block_size

NAME TYPE VALUE
------------------------------------ ----------- -----
db_block_size integer 8192
The update in hot backup has caused the entire 8K block to be written into the redo stream, not just the paltry 180 bytes it would ordinarily generate.
That’s all very interesting, from a theory point of view. But are there any practical consequences?
Absolutely. If the ‘begin backup’ command causes block-sized redo to be generated by lots of transactions, then LGWR (and hence ARCH) are going to be working an awful lot harder than they otherwise would. So will DBWR and CKPT (because online redo logs will fill up more quickly, thus causing more frequent log switching, and at every log switch a system-wide checkpoint is issued). The overall level of I/O will go up, which is not good for users trying to get at their data. And if you cycle through your online logs faster than ARCH can archive them, you’re at risk of a database hang. The block-sized redo issue is, potentially, a real show-stopper.

Practically, it means that you should never put every tablespace into hot backup mode simultaneously, copy the lot, and then take them out of hot backup mode. You should, instead, always take it one tablespace at a time: don’t put tablespace B into hot backup mode until tablespace A is out of it. It makes scripting backups a bit more convoluted, but the one-at-a-time approach is essential if you care about your database not keeling over under the strain. It’s also important that you minimise the amount of time a tablespace spends in hot backup mode... the longer it’s left that way, the more the extra redo hit takes effect on the system.
But hang on! I read that in 10g you can now put the whole database into hot backup mode, with just one command. Why would they invent that, given what you’ve just said?
It’s absolutely true. In 10g, there is now the command alter database begin backup. It puts the entire set of tablespaces into hot backup mode in one hit. So now you can bring your moderately-busy OLTP system to its knees with just one command!

It’s like a lot of Oracle features: they’re there for a particular audience, and just because the corporation gives you a pistol, it doesn’t mean you have to go around shooting yourself in the foot. In this case, the new command is there for those people who take their hot backups by splitting a disk mirror whilst the database is hot, and then copying the now-cold former mirror off somewhere safe. Splitting a mirror takes only a few seconds, so the fact that the entire database is generating block-sized redo for a second or two should not be a complete disaster. Splitting a mirror also requires that every data file be in hot backup mode at the same time. So there is a perfectly valid use for the new feature. If those circumstances apply to you, then be my guest and make use of it. If not, don’t -if you know what’s good for you, that is!
So if I’m not splitting mirrors, but I need hot backups, I’m inevitably forced to generate lots of redo for a long time. Is there no way around this?
Sure there is. It’s called RMAN: if you use Oracle’s own backup and recovery utility (Recovery Manager, hence the short-form name) then you don’t have to put a tablespace into hot backup mode to be able to back it up hot. And if you don’t put it into hot backup mode in the first place, then the block-sized redo issue never arises.

The absence of the performance-crippling block-sized redo issue is indeed one of the main arguments for switching to RMAN to do your backups, and abandoning the ‘O/S-managed’ hot backup method altogether.
But how can RMAN avoid the block-sized redo issue?
That’s really getting into a discussion about RMAN (which is fine, but we ought to save that sort of discussion for a paper of its own). So this will have to be the short answer. The simple fact is that RMAN is an Oracle-supplied utility. So it does understand the crucial significance of the data file header block and the checkpoint change number it contains. Therefore, it makes sure it goes and grabs that block before any others when performing a data file backup. Because it understands the internal structures of an Oracle data file, it can back it up intelligently, and make sure the important parts are done first. Therefore, we don’t need the kludgy locking of the header block that the ‘begin backup’ command gives us.

Similarly, RMAN understands what an Oracle block is, in a way the operating system never could. So if it sees an Oracle block fracturing during a backup, it simply skips that block and comes back to it when it’s all settled down again. It can therefore guarantee clean, synchronised copies of Oracle blocks in the backup file, and so doesn’t have to use the block-sized redo trick to achieve the same thing. Therefore, there’s no need for the ‘begin backup’ command to get around the fractured blocks issue.

So: if you don’t need begin backup to lock the header block; and if you don’t need begin backup to sort out the fractured blocks problem... guess what? You don’t need to issue the begin backup command at all, and consequently the block-sized redo side-effects of that command are also avoided. But you only get these excellent results if you elect to use RMAN for all your backups... so make the switch as soon as possible!

No comments:

Post a Comment