Tuesday, October 6, 2009

Creating External Tables with Data Pump

1.0 Introduction
External tables first made an appearance with the release of Oracle 9i, so they’re not actually all that new in 10g. They were quite a nice feature that allowed you to issue perfectly normal SQL select statements and yet have the contents of simple text files returned to you by way of response. In effect, you could treat a text file as if it were a table, and thus join its contents with that of other tables, both ‘real’ and external. As such, they were intended to help those database users who routinely perform extraction, transformation and loading (ETL) operations: data cleansing exercises, in short. Using external tables, you could keep ‘dirty’ data outside of the database, and yet use it as one element in the production of clean data within the database.
Since 9i’s implementation of external tables relied entirely on the SQL Loader API (you created the external table itself specifying an ‘access driver’ called ORACLE_LOADER -and the clue was in the name!), it’s not surprising that its external tables were strictly read-only: SQL Loader, as its name suggests, only ever loads data into databases. It doesn’t push data out of them back into text files.
But in 10g, we have the Data Pump API (used, for example, by Export Data Pump) whose job is very much to write things out to disk from within the database –and in 10g, therefore, the new feature is that external tables can be created to use the ORACLE_DATAPUMP access driver, and thereby become read-writeable. In essence, this means you can keep ‘dirty’ data outside the database for even longer than before: insteading of joining textual data with other data and incorporating the ‘cleansed’ result inside the database with a Create Table As Select (CTAS) statement as in 9i, you can now issue the same CTAS statement and create another external table. But that is only the half of it: if you can dump the results of joins and complex queries to an external disk file which the database can nevertheless query, you’ve just got yourself a very nice way of porting data between databases.
This short paper will examine how you might go about creating a read-writable external table, as well as investigating why you would want to do so. It will also demonstrate just how truly cross-platform they are.
2.0 Characteristics of Read-Writeable External Tables
Although they are called ‘writeable’ or ‘read-writeable’ external tables, the first thing to be clear about the new flavour of 10g external tables is that you cannot do any inserts, updates or deletes to them. Neither can you create any indexes on them (just as in 9i). They are called ‘writeable’, in fact, only because Oracle can actually create them as part of a CTAS statement -but once they’ve been created, they’re as read-only as external tables ever were in 9i.
You might therefore reasonably wonder what on Earth they are good for! The point is that the external table can be created from within the database as a result of a select statement that you write. That means if you can write a complex query, joining multiple internal and external tables, applying assorted functions to the data, aggregating, grouping and what have you... that can all be captured and stored in an ordinary, flat, operating-system independent file. Put it another way: external tables let you capture the results of simple or complex queries outside of the database. And once the results are outside of the database, they can be shipped and shared with others with ease.
This, then, is the real reason for the new tables’ existence. It’s not to let you add one or two new bits of data to an existing text file: we have vi or notepad for that! It is, however, to allow you to store the results of data scrubbing exercises outside of the database, in preparation for another scrubbing operation. And, as I say, they’re also there to let you easily move data between databases.
Note, incidentally, before we go any further: Oracle has still not provided us with a mechanism to pull data out of an Oracle database into, say, a plain text file or CSV/tab-delimited file: writeable external tables are created in a proprietary, binary format. Creating plain old text files, of course, would make it far too easy to migrate data from an Oracle database to, perish the thought, some other vendor’s database… and we obviously don’t want that sort of thing going on, thank you very much!
You might at this point reasonably ask: since we already have Export Data Pump to extract data from a database, why would anyone need writeable external tables to do the same job? To which the answer is simply that whilst Export DP extracts data rather nicely, it doesn’t let you muck about with what’s extracted. You can’t, in other words, export the resultset of joining three tables and applying two functions and a grouping aggregation. As you generate a read-writeable external table, however, you can apply transformations to the data (anything you can write in a SELECT statement will modify and define the contents of the eventual external table –so, joins with other tables, for example, are easy, as would be applying some function to the data as it is extracted). If you do much ETL work then this is a feature you will very much enjoy.
3.0 Creating a Writeable External Table
As usual, discussing this sort of database functionality is a lot easier when you can see the functionality actually functioning, so a worked example might help! I’ll do what follows as the user Scott (inevitably!), to whom DBA privileges have already been granted. If you were doing it for real, you’d simply need read and write privileges on the directory object through which the extracted data passes on its way to the hard disk. Although my Scott account doesn’t need these privileges granted directly (he’s a DBA, after all), I’ll show the relevant grants here for the sake of clarity.
SQL> connect / as sysdba
Connected.

SQL> create directory ext_dir as 'c:\etl';
Directory created.

SQL> grant read, write on directory ext_dir to scott;
Grant succeeded.
The use of database directory objects here is a nice security touch: it means that control over who can read from, or write to, the location specified in the CREATE DIRECTORY command can be exercised from within the database (by granting and withholding database object privileges) as well as the more obvious operating system restrictions that may (or may not) be in force for the physical directory referenced by the database object.
SQL> connect scott/tiger
Connected.

SQL> create table ext_emp (
   2 empno, ename, sal, mgr, deptno, loc, dname)
   3 organization external
   4 (type oracle_datapump
   5 default directory ext_dir
   6 location ('external_emp.etl'))
   7 as
   8 select e.empno, initcap(e.ename), e.sal*1.1,
   9 e.mgr, d.deptno, d.loc, d.dname
   10 from emp e, dept d
   11 where e.deptno=d.deptno;

Table created.
Here is the heart of the matter, of course. None of this syntax is particularly different from the way we created external tables in Oracle 9i, except for the inclusion of lines 7 onwards (the ‘Select’ bit of syntax that makes this a form of CTAS) –and even those lines are not exactly conceptually challenging. The only real subtlety here, and the real new feature, of course, is the use of the TYPE ORACLE_DATAPUMP line, instead of the TYPE ORACLE_LOADER as you would have used in 9i (and as you still can, of course, use in 10g if you only need to read ‘ordinary’ external tables).
You’ll also notice from this example that I’ve done a subtle bit of ‘transformation’ here: the extracted table will have its ENAME data all starting with initial capital letters, and the SAL column will have its sums inflated by 10%. This simply proves that applying functions (Oracle’s own in-built ones, or your own: it makes no difference which) to data at the point that it is extracted from the database is simply a question of being able to write a decent SELECT statement in SQL. If you can write it, Oracle can pump it to disk! My new external table is therefore, in this yet simplest of cases, going to capture within it the results of a join between two tables and the application of two different functions to the joined data -and that resultset will end up being written to an associated flat file on may laptop’s normal file system.{mospagebreak title=What Gets Created}
3.1 What gets created...
When the database responds that the table has been created, what it actually means is: ‘I have just written a file to your hard disk containing (in this case) the data extracted from two joined tables and transformed as you specified’. You can check the result of that disk writing activity, of course, by dropping to the operating system:
C:\etl>dir

Volume in drive C is W2K
Volume Serial Number is A0E3-CE0E

Directory of C:\etl
08/02/2005 12:12p
.
08/02/2005 12:12p ..
08/02/2005 12:12p 12,288 EXTERNAL_EMP.ETL
08/02/2005 12:12p 45 EXT_EMP_704_504.log
You’ll notice that a log file is created by the table creation process: that will be populated subsequently whenever any operations are performed on the external table (for example, if someone selects from it). It’s rather like the SQL Loader log file you get when performing ordinary SQL Loads, or the one you got when using external tables in 9i. As I’ll show you later, you can actually take control over where this log file gets created and what its name will be, but the point now is simply that you’ll get an automatically-created one in any case whether you like it or not.
But the really significant file here is, of course, that EXTERNAL_EMP.ETL one. It is that which contains the joined, modified and extracted data: if I ship that to another database, it can be read simply by creating a new external table with a definition that (roughly) matches mine. I’ll show you how easy that all is in just a moment.
I said earlier that the extraction file uses a proprietary Oracle binary format… and it does. So don’t think of opening it in notepad or vi and actually saving it from within those programs, because you’ll render it unusable. But feel free to open it in those sorts of text editors simply to have a look at what’s inside it:

 
 
   1
   0
   3
   0
   WE8MSWIN1252
   AL16UTF16
   +11:00
   SCOTT
   EXT_EMP
  
    
     1
     EMPNO
     2
     22
     4
     0
     0
     0
     0
    

    
[and on and on in similar vein for many, many pages!]
That looks suspiciously like XML to me! But what I’ve not shown here is the many pages of weird control characters at the beginning of the file that my text editor displayed when trying to digest this: if it’s XML, it’s XML wrapped up inside a binary file so that no mere mortal can get at it (not without an Oracle license to hand, at any rate).
It’s interesting to note that some of the XML tags you see here describe the character set and time zone of the source database, plus the endianness of the data itself. That should give you the clue that this file is extremely portable: take it over to a Unix database, for example, and the different byte ordering of the data (Little Endian versus Big Endian) can be sorted, because the extraction file itself tells us what byte ordering was used to generate it. Take it to a database in another part of the planet, and all TIMESTAMP WITH TIMEZONE data that may be encased within the extraction file can be appropriately re-processed, because we know where it was originally created. You get the idea, I hope: the extraction file is a comprehensively self-describing bundle of data. And being self-describing like this, it needs no further external help to be read by pretty much any other 10g database.
3.2 Verifying Creation
Of course, having created the table, you will want to check that it works as intended:
EMPNO  ENAME     SAL MGR   DEPTNO LOC      DNAME
------ ------ ------ ----- ------ -------- --------------
7782   Clark    2695 7839      10 NEW YORK ACCOUNTING
7839   King     5500           10 NEW YORK ACCOUNTING
7934   Miller   1430 7782      10 NEW YORK ACCOUNTING
7369   Smith     880 7902      20 DALLAS   RESEARCH
7876   Adams    1210 7788      20 DALLAS   RESEARCH
7902   Ford     3300 7566      20 DALLAS   RESEARCH
7788   Scott    3300 7566      20 DALLAS   RESEARCH
7566   Jones  3272.5 7839      20 DALLAS   RESEARCH
7499   Allen    1760 7698      30 CHICAGO  SALES
7698   Blake    3135 7839      30 CHICAGO  SALES
7654   Martin   1375 7698      30 CHICAGO  SALES
7900   James    1045 7698      30 CHICAGO  SALES
7844   Turner   1650 7698      30 CHICAGO  SALES
7521   Ward     1375 7698      30 CHICAGO  SALES
And as you can see, just like the external tables in 9i, you query an external table just exactly as if it were a normal table. It can therefore, of course, be joined to yet other tables, both external and internal ones, provided only you can write a suitable joining clause. In this way, the results of a partial data cleansing exercise can be written to a ‘holding table’, from where it can be accessed to perform further cleansing runs... yet it nevertheless remains outside the database. Iterative data cleansing operations that don’t actually ‘infest’ the database with a thousand and one partial resultsets are therefore now definnitely do-able. Only when you’re absolutely ready would you bring the data inside the database with a perfectly standard CTAS statement.
3.3 Illegal moves...
Whilst selecting from your new external table is one thing, don’t try any of these sorts of things:
SQL> insert into ext_emp (empno, ename)
  2 values (7385,'ROGERS');
insert into ext_emp (empno, ename)
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table

SQL> create index extidx1 on ext_emp(empno);
create index extidx1 on ext_emp(empno)
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table

SQL> delete from ext_emp;
delete from ext_emp
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table
As I said at the beginning, this new feature is called ‘writeable external tables’ in many books and in Oracle’s own documentation, but that’s clearly not really very true! Perhaps they should better be thought of as ‘wrote-able’: the database wrote them out once, and that’s the end of their writeable-ness. Thereafter, they’re just as read-only as external tables ever were in 9i.
4.0 Transporting Data
If that’s all there was to 10g’s external tables, they probably wouldn’t rate much of a mention by anyone. ETL, after all, is a fairly specialised sort of operation, and you don’t usually do a lot of it day in and day out -so, although the feature is a nice one, it probably wouldn’t set most people’s enthusiasm levels on fire.
But as I’ve already hinted, that’s only part of the story. The other key thing about these external tables is that they capture data in a format that is not locked up within a database, but is in a simple flat file, housed on a regular file system just like any Word document, or spreadsheet file... and, as such, that data can be passed along to any other database that needs it. What’s more, as I indicated when commenting on the inclusion of ENDIANNESS tags in the extract file, it is possible to pass data along to databases which exist on entirely different operating systems than the one where the data originally came from.
To prove that point, I’ll use my external_emp.etl file produced on a Windows 2000 system to load a database running on Red Hat Linux 3 with the original EXT_EMP table. I won’t bore you with the boring business of how I actually physically got the extraction file from one machine to another (lots of SAMBA came in handy!), but let’s just take it as read that the file has successfully been passed between machines (and if you’re going to use FTP to move it between machines, make sure your FTP client knows to treat the file as binary, and not text!).
So, on my linux database I do as follows:
[oracle@garnet oracle]$ sqlplus system/dizwell

SQL*Plus: Release 10.1.0.2.0 - Production on Tue Feb 8 13:46:07 2005
Copyright (c) 1982, 2004, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options

SQL> create directory receive_data as
  2  '/home/oracle/ext_data';

Directory created.

SQL> create table ext_emp (
  2  empno number(4),
  3  ename varchar2(15),
  4  sal number(8,2),
  5  mgr number(4),
  6  deptno number(4),
  7  loc varchar2(20),
  8  dname varchar2(20))
  9  organization external
 10  (type oracle_datapump
 11  default directory receive_data
 12  location ('EXTERNAL_EMP.ETL'));

Table created.

  EMPNO ENAME     SAL   MGR DEPTNO LOC      DNAME
------ ------ ------ ----- ------ -------- --------------
  7782 Clark    2695  7839     10 NEW YORK ACCOUNTING
  7839 King     5500           10 NEW YORK ACCOUNTING
  7934 Miller   1430  7782     10 NEW YORK ACCOUNTING
  7369 Smith     880  7902     20 DALLAS   RESEARCH
  7876 Adams    1210  7788     20 DALLAS   RESEARCH
  7902 Ford     3300  7566     20 DALLAS   RESEARCH
  7788 Scott    3300  7566     20 DALLAS   RESEARCH
  7566 Jones  3272.5  7839     20 DALLAS   RESEARCH
  7499 Allen    1760  7698     30 CHICAGO  SALES
  7698 Blake    3135  7839     30 CHICAGO  SALES
  7654 Martin   1375  7698     30 CHICAGO  SALES
  7900 James    1045  7698     30 CHICAGO  SALES
  7844 Turner   1650  7698     30 CHICAGO  SALES
  7521 Ward     1375  7698     30 CHICAGO  SALES
Now, I don’t know about you, but that impresses me no end. Here I am, logged on to a database running on a completely different operating system, and as a completely different user from the one who created the extraction file in the first place, and yet I’m able to query the data in a second or two. You’ll have to trust me that it took far longer to sort out the syntactical mysteries of mounting SAMBA shares than it did to re-create the external table in the new database!
4.1 Data Loading Considerations
A couple of things should be noted with this example. First of all (though perhaps obviously), the extraction file was located in a completely different path on the second machine than it had been created on the other. That’s clearly fine: so long as a directory object is created which points to the new location, and so long as the default directory clause correctly identifies that new directory object, the thing will work with no complaints.
Second, notice that I did all of the above whilst logged on as SYSTEM: my Linux database doesn’t actually have an account for a user named Scott (strange but true!), and I didn’t have to create one just to access the extraction file. The data in the extraction file is therefore capable of being schema-neutral.
Third, notice how I had to spell out data types and column lengths when creating the external table this time. When I originally created it on the Windows machine, those attributes were deduced automatically from the fact that I supplied a ‘select ... from emp and dept’ clause. Since I’m merely reading the extraction file now, and there is no EMP or DEPT that can be consulted for the information, I have to spell it out.
Finally, I couldn’t actually remember the column-length definitions for the original EMP and DEPT tables, so when I had to spell them out, I simply made them up! Obviously, I over-estimated rather than under-estimated: if the extraction file had data for an employee number ‘9873’ and I’d said that column should only be a number(2), I’d have been in trouble. The real point here is that you don’t have to be awfully precise about it: provided the data can fit into the columns you create for it, then the querying of the external table will work.
4.2 Data Loading Errors
Incidentally, this is what you’ll see when you guess wrong. On this occasion, I did indeed set the EMPNO column to be only two digits long:
SQL> select * from ext_emp;
select * from ext_emp
*
ERROR at line 1:
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-30653: reject limit reached
ORA-06512: at "SYS.ORACLE_DATAPUMP", line 52
ORA-06512: at line 1
That’s the rather unfriendly error message you get as you actually do the select: there’s no indication there what might be wrong. But if you have a look inside the log file which gets automatically created (in my case, in the same directory as the extraction file itself resides), you’ll see something like this
KUP-11003: bad data encountered while reading from file /home/oracle/ext_data/EXTERNAL_EMP.ETL
ORA-01438: value larger than specified precision allows for this column
EMPNO : '7782'
...and that makes it abundantly clear what the problem was. My example of ‘dodgy data’ was, obviously, rather crude: all records inevitably failed the test, and thus I saw no records returned at all for my query. However, you could find yourself in a situation where some records are OK, and some are not... in which case, you might not want the entire select blowing up as spectacularly as mine has just done.
To deal with that, you can specify a REJECT LIMIT. You do so right at the end of the CREATE TABLE statement, like so:
create table ext_emp (
empno number(3),
ename varchar2(15),
sal number(8,2),
mgr number(4),
deptno number(2),
loc varchar2(20),
dname varchar2(20))
organization external
(type oracle_datapump
  default directory receive_data
  access parameters
  (logfile 'logme.txt')
  location ('EXTERNAL_EMP.ETL')
)
reject limit 20;
If I’d created my table like this, then up to 20 erroneous records would be ignored, without triggering the catastrophic end of the entire select statement. Since my table only has 14 rows in it, my query would simply have received a polite message that there were no rows selected. Only on selecting the 21st record would things have turned nasty.
Notice also in this example how I can choose the name of my log file (wthout the ACCESS PARAMETERS clause, Oracle creates one for you, with a name of its own automatic devising). If you have much experience working with SQL Loader or 9i’s external tables, you are probably wondering where the BADFILE and DISCARDFILE clauses are. Well, wonder no longer: you can’t specify those two clauses at all when the oracle_datapump driver is being used. If you could, after all, it would be a simple matter to then also deliberately arrange for bad or discard records to be produced -and then you would have invented a way to generate text-file data from a proprietary Oracle binary file format!{mospagebreak title=Column Projection}
4.3 Column Projection
Following on from a consideration of what happens when your extract file contains dodgy data, you should be aware that external tables using the oracle_datapump access driver behave differently in 10g from their 9i cousins in one significant respect: if you select columns A, D and H from a multi-column table, 9i only parsed the extracted data file for those three columns. If columns B, C, E, F and G were stuffed to the rafters with dodgy data, it made no difference: you weren’t selecting that dodgy data, so the report proceeded to a successful conclusion with no visible signs of a problem. We talk about this behaviour as meaning that the external table “projects” only the columns referenced by a query.
The alternative approach is the one 10g now adopts by default: project all columns in the extract file, regardless of whether they have all been selected by the SQL statement or not. I repeat that this is the default behaviour in 10g -because it can be changed. If you issue this command:
alter table ext_emp project column referenced;
...then 10g reverts to behaving like 9i. You can change things back again with this variation of the same command:
alter table ext_emp project column all;
At first glance, the 10g default sounds a bit daft: ALL means that one dodgy piece of data anywhere in a row will stop a row appearing on a report, even though the report wasn’t querying the flakey column. It’s very restrictive, in other words -and surely REFERENCED makes a lot more sense? Well, it depends on what your expectations are. The danger with the permissiveness of REFERENCED is that two similar select statements will yield completely different results, just because one of them happened to refer to a column with some dodgy data in it, and the other one didn’t. At least with the 10g default of ALL both reports will give consistent answers.
A simple example may help illustrate the issue:
SQL> create table e as select * from emp;

SQL> update e set empno = empno-7000;
14 rows updated.

SQL> update e set empno=empno+empno where ename='KING';
1 row updated.

SQL> commit;

SQL> select empno from e;

     EMPNO
----------
       369
       499
       521
       566
       654
       698
       782
       788
      1678
       844
       876
       900
       902
       934
So all my employees bar one now have three digit employee numbers. Knowing that, let’s try to create a new external table, and then select back from it. Because I’m not bothering to actually transport this data to a new database, I’ll simulate that by dropping the external table once the extraction file is safe, and then re-creating it, just as I did when I physically moved it to a Linux machine:
SQL> create table ext_e
  2  organization external
  3  (type oracle_datapump
  4  default directory ext_dir
  5  location ('tbl_e.etl'))
  6  as
  7  select * from e;

Table created.

SQL> drop table ext_e;
Table dropped.

SQL> create table ext_e
  2  (empno number(3),
  3  ename varchar2(10),
  4  job varchar2(10),
  5  mgr number(4),
  6  hiredate date,
  7  sal number(7,2),
  8  comm number(7,2),
  9  deptno number(2))
 10  organization external
 11  (type oracle_datapump
 12  default directory ext_dir
 13  location ('tbl_e.etl'))
 14  reject limit 20;

Table created.

Notice the oh-so-subtle way I’ve declared EMPNO to be a three-digit field! That will be true for 13 of my employees, but not for one of them. Now let’s query:
SQL> select ename from ext_e;

ENAME
----------
SMITH
ALLEN
WARD
JONES
MARTIN
BLAKE
CLARK
SCOTT
TURNER
ADAMS
JAMES
FORD
MILLER

13 rows selected.

SQL> select empno from ext_e;

     EMPNO
----------
       369
       499
       521
       566
       654
       698
       782
       788
       844
       876
       900
       902
       934

13 rows selected.
So you can see that in either case, I only see records for 13 employees -even though the first query I issued was for the ENAME column alone, and you might have thought that the ENAME for the employee with a 4-digit employee number  didn’t need to be suppressed off the report: it’s the EMPNO column we have a sizing issue with, not ENAME.
Well, that’s the effect of ALL projection: it doesn't matter if your query doesn't explicitly "touch" or "reference" the column with the dodgy data in it: all columns of the source file are checked whether you like it or not, and any dodgy data found anywhere causes that row to disappear off all reports. That at least has the effect of consistency: 13 might not actually be the number of employees in my company, but at least I won't be seeing some reports claiming 13 and others 14!
If you alter the projection type to the REFERENCED setting, however, the results are rather different:
SQL> alter table ext_e project column reference
Table altered.

SQL> select ename from ext_e;

ENAME
----------
SMITH
ALLEN
WARD
JONES
MARTIN
BLAKE
CLARK
SCOTT
KING
TURNER
ADAMS
JAMES
FORD
MILLER

14 rows selected.

SQL> select empno from ext_e;

     EMPNO
----------
       369
       499
       521
       566
       654
       698
       782
       788
       844
       876
       900
       902
       934

13 rows selected.
Well, which is it? Do I employ 13 people or 14?! The two queries now produce different reports, because one obviously ‘touches’ the dodgy four-digit employee number that cannot actually be displayed, and the other doesn’t and so has no problems dealing with Mr. King’s record at all. When you select REFERENCED projection, in other words, your queries only reject rows with dodgy data in them if they explicitly make reference to them. Rows that possess dodgy data in columns which aren't actually referenced by a particular query are allowed through.
Which is all fascinating stuff, but you might wonder why you have the choice! It actually boils down to speed: referenced projection is quicker, because Oracle only has to check the contents of the specific columns you mention in your query. All projection is much slower, because even if you only query a 1-character column, we have to check every column of every row for the quality of its data. So if you know ahead of time that the external source which has provided the data is reliable, why not go for the faster option and do referenced projection?
On the other hand, if consistency of reporting is your paramount requirement, and you cannot place absolute reliance on the original source of the data being queried, you might well be prepared to accept a speed hit in exchange for guaranteed cleanliness of data, and go for the "all" projection.{mospagebreak title=Data Type Restrictions}
5.0 Data Type Restrictions
Read-writeable external tables cannot contain LONG data types. CLOB and BLOB data types are fine, however. This sort of practical limitation has been pretty standard fare with Oracle since about version 8i, so it just re-enforces the message that has been around for a long, long time: don’t start using LONGs and LONG RAWs now. They’re there strictly for backwards compatibility reasons.
If the tables which you wish to extract to an external table (that is, the one that will appear in the initial SELECT part of the CREATE TABLE statement) do contain LONG or LONG RAW columns, then the extraction can nevertheless go ahead, provided that the corresponding columns in the external table are defined as CLOB and BLOB data types.
So, for example, assume a table contains (amongst others) a column, FILM, of data type LONG RAW. This would be acceptable:
create table ext_film (
filmname varchar2(30),
film BLOB)
organization external
(type oracle_datapump
default directory ext_data
location (‘FILMDAT.DMP’))
as
select filmname, to_lob(film) from FILMLIB;
Notice that you have to use the TO_LOB function to do the actual conversion of the data type: just because the two columns match up positionally, Oracle won’t volunteer a data conversion for you. A simpler form of that syntax, incidentally, would be:
create table ext_film (
organization external
(type oracle_datapump
default directory ext_data
location (‘FILMDAT.DMP’))
as
select filmname, to_lob(film) from FILMLIB;
In other words: there’s no real need to define columns in the new table if you’re simply going to copy every column in the source table anyway!
Extracting BFILE data types into an external table is also a bit tricky -indeed rather trickier than dealing with BLOBs and CLOBs. I’ll leave the gory details to the official Oracle documentation, however, since there’s not much I can add to it, however, since there’s not much I can add to it.
6.0 Parallelism
Pulling and pushing data between the hard disk and the database is not light work. It’s therefore important to know that it is possible to parallelise the operations. You do so by specifying a PARALLEL clause when initially creating the extraction file. For example:
create table ext_emp (
empno, ename, sal, mgr, deptno, loc, dname)
organization external
(type oracle_datapump
default directory ext_dir
location ('external_emp.etl'))
PARALLEL 4
as
select e.empno, initcap(e.ename), e.sal*1.1,
e.mgr, d.deptno, d.loc, d.dname
from emp e, dept d
where e.deptno=d.deptno;
That’s actually just the same example as I originally had at the start of this article, with just one new line inserted into the middle of it, just before the AS clause: PARALLEL 4 means that 4 parallel slaves (and one co-ordinator) will simultaneously perform the write to the extraction file. Bear in mind that you could drop a parallel hint into the SELECT part of the syntax, too, as one method of ensuring parallelism on the select side of things as well as on the file creation side.
There has been some confusion about this parameter in a number of publications, though I’m not sure why since the behaviour hasn’t changed since 9i: the degree of parallelism you specify is completely independent of the number of files you want generated (or read from, if your selecting the data back into a database). The official Oracle documentation is clear on this point, as it has been since version 9i. Which is why I don’t understand (to take just one example) this statement in Robert Freeman’s Oracle 10g New Features Oracle Press book (page 145):
Oracle only creates one process for each file that is defined [in the LOCATION clause], so the degree of parallelism is limited to the number of files that are defined for the external table to write to.
That is definitely not true and should be regarded as an editorial slip-up on Oracle Press’s part.
It is, however, likely to be the case that if you’re attempting to parallelise the creation of an extraction file, then having multiple slaves all trying to slam data down on the same disk at the same time is not necessarily a good idea. So, syntactically, you can play about with the LOCATION clause to fix the issue. For example:
SQL> create directory dir1 as 'c:\d1';
Directory created.
SQL> create directory dir2 as 'c:\d2';
Directory created.
SQL> create directory dir3 as 'c:\d3';
Directory created.

SQL> create table ext_emp (
  2  empno, ename, sal, mgr, deptno, loc, dname)
  3  organization external
  4  (type oracle_datapump
  5  default directory ext_dir
  6  location ('dir1:external_emp1.etl','dir2:external_emp2.etl',
  7  'dir3:external_emp3.etl'))
  8  parallel 3
  9  as
 10  select e.empno, initcap(e.ename), e.sal*1.1,
 11  e.mgr, d.deptno, d.loc, d.dname
 12  from emp e, dept d
 13  where e.deptno=d.deptno;

Table created.
Here, I create three directory objects in the database. I have asked for a degree of parallelism of three in the PARALLEL clause. But I have additionally specified three output files to be created in the LOCATION clause. Note the syntax here: the names are specified in the format directory:filename, and a comma-separated list of files is spelled out. Don’t make the mistake of trying to tinker with the DEFAULT DIRECTORY clause to achieve the same sort of thing: it’s only a default, and there can only ever be one such default directory. If you want multiple files, the multiple directories are specified elsewhere!
Finally, I stress again that you can specify parallelism without specifying many output files. When your hardware is a massive RAID array such that everything is striped everywhere anyway, then there’s no real point in worrying about a single file bottle-necking the efficiency of a parallel operation, because the I/O striping in such an array will mean that it won’t be an issue. But when you are extracting to a simple file system, then fair enough: it makes sense to stop the multiple processes fighting over a single locations and output file by specifying many of them.



No comments:

Post a Comment