Tuesday, October 6, 2009

Tuning Your SQL for Maximum Performance

Nothing can make a good application perform badly like poorly tuned SQL.
Suppose that you have a database that is used by the HR and payroll systems to keep track of employee information and payroll transactions. Consider the Employees table as shown below:
create table employees (
employee_id varchar2(9) not null,
first_name varchar2(30) not null,
last_name varchar2(30) not null,
hire_date date not null,
department_id varchar2(10) not null,
salary number(9,2)
);
Now, suppose you want to find out how a simple query that looks up an employee by employee ID executes. To do this, you need to explain the execution plan of the query you are executing. In SQL*Plus, you could do this as follows:
set autotrace traceonly explain;
select first_name, last_name, hire_date
from employees
where employee_id = '123456789';
set autotrace off;
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=18 Card=1 Bytes=29)
1 TABLE ACCESS (FULL) OF 'EMPLOYEES' (TABLE)
(Cost=18 Card=1 Bytes=29)
This output shows the execution plan for the query.
Line 0 shows important information about the entire query. First, note the portion that says Optimizer=CHOOSE. This indicates that Oracle's cost-based optimizer is choosing the mode in which it runs on the fly. There are two possible modes: on-line transaction processing (OLTP) and data warehousing (DW). An OLTP database, such as an airline reservation system, would likely see a high volume of transactions, while a data warehouse, such as a business process management system, might only be updated nightly, weekly, or monthly. Here, the default configuration of Oracle gives the optimizer the freedom to choose which mode to use based on the characteristics of this particular database.
Next, you will notice that line 0 indicates Cost=18. This is the cost for the entire query, and is a relative indication of how much processing has to be done to execute it. In general, lower cost equals better performance. However, this is a relative measurement, not an absolute measurement of the computing resources needed to run the query. So, a query that has a cost of 16 would run slightly faster than this query does, and a query with a cost of 6000 would run significantly more slowly. The actual execution time and resources needed to run the query depend on many different factors, so the cost should be used to determine whether you are helping or hindering query performance when tuning a query.
Next on this line, you will see Card=1. This value indicates how many rows must be processed in order to execute the query. This is Oracle's best guess, based on the data currently existing in the table.
Finally on line 0, we have Bytes=29. This number indicates the amount of I/O in bytes that must occur between the client and the server to give back the results of the query.
So, for purposes of query tuning, line 0 gives a good idea of the total resources needed to run the query, how many rows will be processed, and how much data must be exchanged between client and server to return the results. Now, on to the meat of the tune-up for this query.
Line 1 shows TABLE ACCESS (FULL) OF 'EMPLOYEES' as step 1 of the execution plan. This means that, to run this query, every row must be read in the EMPLOYEES table until the right one is found. Data is kept in this table in the order in which it was inserted, so this might require one read, 20 reads, or 2 million reads, depending on when the employee ID 123456789 was inserted. In general, a full-table scan is the absolute worst-performing way to get data from a table.
So, how can the query be fixed? By creating a primary key on the table. Create the primary key as follows:
alter table employees
add constraint employees_pk primary key (employee_id);
Now, performing the explanation of the query again gives the following execution plan:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=1 Card=1 Bytes=29)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'EMPLOYEES' (TABLE)
(Cost=1 Card=1 Bytes=29)
2 1 INDEX (UNIQUE SCAN) OF 'EMPLOYEES_PK' (INDEX (UNIQUE)) (
Cost=0 Card=1)
So, what has been accomplished by adding the primary key? On line 2, notice that you have INDEX (UNIQUE SCAN) OF 'EMPLOYEES_PK'. Scanning an index is far quicker than scanning an entire table for data. In this case, a unique scan is used; this means that the Oracle optimizer is able to figure out whether the row exists in the table by doing one lookup on the index. Also, line 1 shows TABLE ACCESS (BY INDEX ROWID) OF 'EMPLOYEES'. When Oracle goes to the primary key to look up the data, it is able to find the row ID of the needed row. The row ID is like a unique street address of this row. Once this value is known, it takes only one access of the table to retrieve the desired data.
Finally, looking at line 0, you can see that the cost has decreased from 18 down to 1. It is easy to see that adding the primary key has improved the query's performance.
Now, on to some more complicated examples.
Searching For Answers
Suppose that you want to perform a search on the Employees table by first or last name. Although there now is a primary key on this table, it will not help in this case, because only the employee ID was included in the primary key. So, running a plan explanation on the following query:
set autotrace traceonly explain;
select first_name, last_name, hire_date
from employees
where last_name like 'Doe%';
set autotrace off;
results in the following familiar and less-than-ideal results:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=18 Card=1 Bytes=19)
1 0 TABLE ACCESS (FULL) OF 'EMPLOYEES' (TABLE)
(Cost=18 Card=1 Bytes=19)
How can you speed up this query's performance? The solution is to create an index on the search fields as follows:
create index employees_idx01 on employees (last_name, first_name);
Creating this index yields the following new execution plan:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=2 Card=1 Bytes=19)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'EMPLOYEES' (TABLE)
(Cost=2 Card=1 Bytes=19)
2 1 INDEX (RANGE SCAN) OF 'EMPLOYEES_IDX01' (INDEX)
(Cost=1 Card=1)

The index range scan on line 2 is like your search through the phone book; you must scan through the names to find the address of the row you want. Then, once the address is found, the data is retrieved by index row ID, which is the most efficient way to locate the data in the table.
Suppose you have a table of departments as shown below:
create table departments (
department_id varchar2(10) not null,
department_name varchar2(30)
);

alter table departments
add constraint department_pk primary key (department_id);

alter table employees
add constraint employees_fk1
foreign key (department_id)
references departments (department_id);
Now, suppose you want to find all employees in the Accounting department, as shown in the following query:
select e.employee_id, e.first_name, e.last_name
from employees e, departments d
where d.department_name = 'Accounting' and
d.department_id = e.department_id;
This query has the following execution plan:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=22 Card=1 Bytes=49)
1 0 NESTED LOOPS
2 1 NESTED LOOPS
(Cost=22 Card=1 Bytes=49)
3 2 TABLE ACCESS (FULL) OF 'EMPLOYEES' (TABLE)
(Cost=18 Card=4 Bytes=100)

4 2 INDEX (UNIQUE SCAN) OF 'DEPARTMENT_PK' (INDEX (UNIQUE) )
(Cost=0 Card=1)

5 1 TABLE ACCESS (BY INDEX ROWID) OF 'DEPARTMENTS' (TABLE)
(Cost=1 Card=1 Bytes=24)
Obviously, the poorest-performing portion of this query is the full-table scan on the Employees table. Creating an index on this table to optimize a search by department_id as follows:
create index employees_idx02
on employees (department_id, employee_id);
yields the following execution plan:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=7 Card=1 Bytes=49)
1 0 NESTED LOOPS
2 1 NESTED LOOPS
(Cost=7 Card=1 Bytes=49)
3 2 VIEW OF 'index$_join$_001' (VIEW)
(Cost=3 Card=4 Bytes=100)
4 3 HASH JOIN
5 4 INDEX (FAST FULL SCAN) OF 'EMPLOYEES_IDX01' (INDEX)
(Cost=1 Card=4 Bytes=100)
6 4 INDEX (FAST FULL SCAN) OF 'EMPLOYEES_IDX02' (INDEX)
(Cost=1 Card=4 Bytes=100)

7 2 INDEX (UNIQUE SCAN) OF 'DEPARTMENT_PK' (INDEX (UNIQUE) )
(Cost=0 Card=1)

8 1 TABLE ACCESS (BY INDEX ROWID) OF 'DEPARTMENTS' (TABLE)
(Cost=1 Card=1 Bytes=24)
In this situation, the department_id field is an example of a foreign key. This means that a value in a table is a key value in another table. In this case, the department_id field is unique on the departments table because it is the table's primary key. It is also an attribute of the Employees table because each employee belongs to a department. Having a foreign key constraint on the Employees table insures the data integrity of the employee data because this constraint will not allow an employee record to contain any value that does not exist in the Departments table.
Having an index that contains the department_id field on the Employees table increases the efficiency of the search, as shown from the decrease in cost after the index is created. Any time that a foreign key relationship exists like the one described here, it is always best to have an index that contains that foreign key.
So, if indexes are such wonderful things for query performance, why not just create an index on every column of every table? In theory, this sounds great, but the convenience of an index comes at a price. Having many indexes on a table will slow down DML operations, such as inserts, updates, and deletes. This is because each time a row in the table is inserted, changed, or removed, the associated indexes also have to be changed to reflect the table data. Also, if there are any foreign key constraints to maintain data integrity as shown in this example, Oracle must verify that any value that is about to be inserted is valid.
This will slow down these operations significantly for tables with many columns that contain many indexes. The definition of "many" indexes is rather subjective. However, the best practice is to create indexes only where foreign key constraints or search functionality calls for them.
Take A Hint
Creating indexes can improve query performance; however, sometimes, even when appropriate indexes are created, Oracle is still not able to figure out the best way to execute a query. Consider the following example.
Suppose you have a payroll transactions table and indexes as follows:
create table payroll_transactions (
transaction_id number(38) not null,
transaction_type varchar2(10) not null,
transaction_date date not null,
employee_id varchar2(9) not null,
amount number(9, 2) not null
);

alter table payroll_transactions
add constraint payroll_transaction_pk primary key
(transaction_id);

create index payroll_transaction_idx01 on
payroll_transactions (transaction_date);

create index payroll_transaction_idx02 on
payroll_transactions (transaction_date, transaction_type);

create index payroll_transaction_idx03 on
payroll_transactions (employee_id);

create index payroll_transaction_idx04 on
payroll_transactions (transaction_date, employee_id);
Now, suppose you want to figure out how much money a particular employee has been paid via automatic deposit since his hire date:
select sum(amount) from payroll_transactions where transaction_date > to_date('01/01/2008','mm/dd/yyyy') and employee_id = '123456789';
This query would use the payroll_transaction_idx03 index because it is the first index on the table containing the employee ID, as shown by the following execution plan:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=2 Card=1 Bytes=20)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF
'PAYROLL_TRANSACTIONS' (TABLE)
(Cost=2 Card=1 Bytes=20)

3 2 INDEX (RANGE SCAN) OF 'PAYROLL_TRANSACTION_IDX03' (IND EX)
(Cost=1 Card=2)
However, the table also has payroll_transaction_idx04, which contains both the transaction date and the employee ID, which is a perfect fit. Luckily, there is a way to show Oracle the preferred way for the query to be executed: by giving the optimizer a hint. Executing the following query:
select /*+INDEX(payroll_transactions payroll_transaction_idx04)*/
sum(amount)
from payroll_transactions
where transaction_date > to_date('01/01/2008',
'mm/dd/yyyy') and
employee_id = '123456789';
yields the following execution plan, which shows that the optimizer is using the desired index:
Execution Plan
----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE
(Cost=2 Card=1 Bytes=20)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'PAYROLL_TRANSACTIONS' (TABLE)
(Cost=2 Card=1 Bytes=20)
Although the cost for both queries is the same for the small sample data set used here, consider a payroll transaction table containing data for thousands of employees over the course of many years. The index containing the transaction date and employee ID would be the choice that would contain the most unique information. The hint forces the optimizer to use the correct index with the following:
/*+INDEX(payroll_transactions payroll_transaction_idx04)*/
The hint is always enclosed in /* and */, and must always begin with a plus sign. Then, the hint instructs Oracle to use an index by specifying the INDEX keyword. Finally, the desired table name and index on that table are specified by putting them in parentheses after the index hint. Failing to get the syntax of the hint exactly right will result in Oracle ignoring the hint.
The same thing could be accomplished by telling Oracle to exclude the payroll_transaction_idx03 index specifically by executing the following query:

select
/*+NO_INDEX(payroll_transactions payroll_transaction_idx03)*/
sum(amount) from payroll_transactions where transaction_date > to_date('01/01/2008', 'mm/dd/yyyy') and employee_id = '123456789';
This would result in the same execution plan above where payroll_transaction_idx04 is used.
A word of caution is necessary here concerning table aliases. Aliases are often used as shorthand to make complicated queries a little easier to read. If aliases were to be used, as in the following query:
select /*+INDEX(payroll_transactions payroll_transaction_idx04)*/ b.first_name, b.last_name, sum(a.amount) from payroll_transactions a, employees b where a.transaction_date > to_date('01/01/2008',
'mm/dd/yyyy') and a.employee_id = '123456789' and a.employee_id = b.employee_id;
the hint specified would be ignored. Instead, you must use the alias within the hint in place of the table name:
select /*+INDEX(a payroll_transaction_idx04)*/ sum(amount)
from payroll_transactions where transaction_date > to_date('01/01/2008', 'mm/dd/yyyy') and employee_id = '123456789';
Now, consider another handy hint. As discussed earlier, performing a full-table scan is usually very inefficient, and should be avoided, but this is not always the case. Consider the following query on the Departments table:
select department_id from departments where department_name = 'Accounting';
If the Departments table contains a small number of rows, it may be more efficient to read the entire table than it is to use an index. To force Oracle to bypass the index, you can specify a hint as follows:
select /*+FULL(departments)*/ department_id from departments
where department_name = 'Accounting';
This may seem counterintuitive, but small tables are likely to be cached in memory, and would require no disk reads. Specifying the hint actually cuts down on I/O and speeds up the query.
Another example where a full-table scan might prove to be more efficient than an index is a case where the makeup of the data is not very unique. Consider a Gender column on the Employees table with only M for male or F for female employees. To get a count of the number of male and female employees:
select gender, count(*) from employees group by gender;
Oracle would need to scan all rows in the table to get the counts anyway, so using any index would just add overhead to the process. If an index were used, each row lookup would require that Oracle find it in the index, requiring one I/O operation, and then another to read the data from the table when the row is found from the index. This would nearly double the amount of I/O required to execute the query.
Hints should be used only in cases where a database design change or creation of a well-thought-out index does not help boost performance. Ideally, once a query is tuned using a hint, changes should be put in place so that Oracle's cost-based optimizer can run the query efficiently without the hint. Still, hints can be extremely helpful when solving performance issues, and are still necessary in cases where changes to the database are not possible.
The Numbers Game
So far, the cost value shown in the query execution plan has been used to gauge how well a query will perform. Now, it's time to take a look at how that cost is calculated.
For each table, index, and column in the database, Oracle keeps statistics on the composition of the data.
For tables, this includes:
1. the number of rows in the table
2. the amount of storage required to store all of the table's data
3. the average size of each row in the table.
The number of rows is important when determining, for example, whether it is more efficient to use an index or to scan through a table sequentially. The number of data blocks would give Oracle an idea of how much I/O would be necessary to read through the table's data.
For indexes, Oracle tracks the number of index entries and the uniqueness of these entries. This is how Oracle calculates the cardinality number seen earlier in the query execution plans.
For columns, Oracle tracks the number of distinct values, null values, and the density of the data. For example, consider a Gender column in a table that only contains M or F for male or female. Oracle would use the information about the relative non-uniqueness of this data when determining an execution plan.
All of this information helps Oracle's optimizer to make wise decisions about query execution.
Statistics are usually recomputed periodically by the database administrator based either on a period of time since the last time the statistics were gathered, or on the amount of data that has changed in the table since that time. If statistics become so out-of-date that they no longer reflect the actual makeup of the table's data, Oracle's optimizer may make bad decisions about query execution plans. Likewise, if a significant amount of data has changed over a short period of time, query performance might suddenly take a nose-dive. To address this, most DBAs use scripts to periodically re-analyze tables whose statistics are becoming old. Oracle also keeps track of the amount of data that has changed since the last table analysis, and can flag the statistics as stale if too much activity has occurred since the last analysis.
So, how could stale statistics impact database performance? Suppose the Employees table has only 2 records in it when statistics are created. At that point, the Oracle query optimizer would decide that a full-table scan of Employees is the quickest way to find the desired row. Then, suppose that 100,000 rows are inserted into the table, and statistics are not re-created. The query optimizer would still be working under the assumption that the table is very small, and would recommend a full-table scan, which would obviously perform poorly now that the table contained much more data.
The first way to gather statistics is to have Oracle analyze the complete table data and recompute all of the statistics for the table. The command to do this is as follows:
Analyze table Employees compute statistics;
During this process, Oracle exhaustively looks at all of the data and recalculates all of the statistical values for that table. Although this is likely to yield a very accurate analysis of the table, this process can take a very long time if performed on large tables. The process usually must be performed during a maintenance window because Oracle must lock the table being analyzed, which would hold up any pending transactions. This problem is now somewhat addressed by the implementation of Oracle RAC, a system of clustering databases. Oracle RAC allows the same database to be clusterd on multiple nodes, so maintenance can be done on one node while another is still available to applications. If Oracle RAC is not an option, however, there is another way to compute statistics that takes less time:
Analyze table Employees estimate statistics sample 1000 rows;
Or:
Analyze table Employees estimate statistics sample 10 percent;
This command does recompute statistics for the table based on actual data, but only on a fraction of it. In the first example, Oracle will sample 1000 rows of data; in the second, it will sample 10% of the total data in the table. Thus, even for a large table, the time to complete the analysis can be far less than the full analysis performed by totally recomputing all statistics.
Later versions of Oracle have implemented special packages for the purpose of analyzing databases, schemas, and tables. Using these utility packages offers a bit more flexibility than the analyze commands above. To analyze a table and compute statistics, one can use the following command:
exec DBMS_STATS.gather_table_stats('MYSCHEMA', 'Employees');
To estimate statistics for a table, use the following:
Exec DBMS_STATS.gather_table_stats('MYSCHEMA', 'Employees',
estimate_percent = 10);
It is also possible to set statistics manually. This might be necessary as a stop-gap measure to fix poorly running queries until a more thorough job of gathering statistics can be done. It might also be useful if a table often fluctuates between a small and large amount of data over a short period of time, such as a log table whose data is purged out every night.
It is especially important to be sure that all tables referenced in a query have statistics available to the optimizer. Having one table with statistics and another without them is a recipe for inconsistent query performance. Oracle may attempt to estimate the statistics on-the-fly when this occurs, resulting in a performance hit. Your database administrator should have measures in place to insure that statistics are recalculated regularly on all tables. However, if it is not possible to do this for all tables in a query, you can totally bypass the cost-based optimizer by specifying the /*+RULE*/ hint. This causes Oracle to revert to a much simpler approach that does not use statistics, and relies solely on the primary keys and indexes available on the table.
Conclusion
In this article, you have explored how to get an execution plan for a query, maximize query performance with indexes, use hints to tune queries, and generate statistics for Oracle's cost-based optimizer to use when creating query execution plans. Hopefully, these tools will get you well on your way to writing efficient SQL for your applications.

No comments:

Post a Comment