Oracle data skew & statistics
I recently encountered a classic example of the Oracle Optimizer being fooled by data skew. James Morle, presented at Hotsos recently on how the vast majority of issues are caused by skew and latency, certainly I have seen many bad plans chosen by the Oracle optimizer that were the result of the optimizer not being aware of the distribution of the data.
So I have a view looking like the following:
Name Null? Type -------------------------- ------- -------------- KEY NOT NULL VARCHAR2(255) SUFFIX NOT NULL VARCHAR2(255) INSTANCE NOT NULL NUMBER(10) NS_ID NOT NULL NUMBER(10) CREATED NOT NULL DATE CREATED_BY VARCHAR2(200) REMOVED DATE REMOVED_BY VARCHAR2(200)
The type of query that we were having issues was of the form:
SELECT COUNT(*) FROM NS_ON_DOMAINS WHERE KEY = :B4 AND SUFFIX = :B3 AND INSTANCE = :B2 AND NS_ID = :B1 AND REMOVED IS NULL
First thing to bear in mind, is that key, suffix is pretty selective. ns_id, can be highly selective, but it can also be extremely unselective. I used the technique described by Greg Rahn to determine why the optimizer was choosing a particular plan that was providing a response time of the order of 3 minutes.
--------------------------------------------------------------------------------------------- | Id | Operation |Name|Starts|E-Rows|A-Rows|A-Time|Buffers|Reads | --------------------------------------------------------------------------------------------- | 1 |SORT AGGREGATE | | 1 | 1 | 1 |00:02:07.29 | 19668 | 17459 | |*2 | TABLE ACCESS BY INDEX ROWID| NS_ON_DOMAINS|1 | 1 | 0 |00:02:07.29 |19668 |17459 | |*3 | INDEX RANGE SCAN | IX_NS_ON_DOMAINS_NS | 1 |1 |49495 |00:00:00.32 |178 |177 | ---------------------------------------------------------------------------------------------
So the optimizer has chosen to use the IX_NS_ON_DOMAINS_NS with the NS_ID value, and thinks this will be highly selective. Very often this turns out to be the case, however in a particular case, this value is completely non-selective and the actual numbers of rows is quite huge.
Forcing the optimizer to use the other obvious index, the one on key, suffix, and instance we have the following:
----------------------------------------------------------------------------- | Id | Operation| Name |Starts|E-Rows|A-Rows|A-Time|Buffers|Reads | ------------------------------------------------------------------------------ | 1|SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 |4 | 1 | |*2| TABLE ACCESS BY INDEX ROWID| NS_ON_DOMAINS |1 |1 |0 |00:00:00.01 |4 |1 | |*3| INDEX RANGE SCAN | IX_NS_ON_DOMAINS_DOMAIN |1|3|0 |00:00:00.01 |4|1 | -----------------------------------------------------------------------------
Now the optimizer is over estimating the number of rows it thinks it will find via this access path. This tells us why for the particular values of the bind variables we were looking at, that the plan chosen by the optimizer is not the most optimal path to the data. The optimizer is being tricked by the fact we have a very large skew on this data.
In the end we decided to use a hint for this particular query to force it to use the better index.


May 30th, 2008 at 10:14 am
Nice case.
It would be interesting to see what an event 10053 said…