random technical thoughts from the Nominet technical team

Tracing Oracle Procedures submitted via dbms_job

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 3 out of 5)
Loading ... Loading ...
Posted by patrick on May 2nd, 2008

In order to investigate a performance problem in our Oracle database, I needed to trace one of our batch PL/SQL procedures, which has been submitted via the dbms_job package to execute every morning. I had traced the procedure successfully when executing it manually from a SQL*Plus session: a SQL trace file was produced in the directory defined by the database parameter user_dump_dest. Below is a description of how I then went on to trace the procedure when it was executing as a job submitted via dbms_job.

Assume a job has been submitted to execute every morning at 9am:

variable job_num number
exec dbms_job.submit (:job_num, my_procedure, sysdate, '(trunc(sysdate + 1) + 9/24)');
commit;

From a SQL*Plus session examine dba_jobs_running, to verify that the job has started and obtain its session identifier (sid):

select * from dba_jobs_running;

 SID  JOB FAILURES LAST_DATE LAST_SEC THIS_DATE           THIS_SEC INSTANCE
---- ---- -------- --------- -------- ------------------- -------- --------
 356  467                             02/05/2008 14:37:35 14:37:35        0

Obtain the Oracle serial number (serial#) for this database session:

select sid, serial# from v$session where sid = 356;

SID SERIAL#
--- -------
356     701

Use the dbms_monitor.session_trace_enable procedure to start SQL tracing for this session:

exec dbms_monitor.session_trace_enable (356, 701);

A SQL trace file will be produced in the directory defined by the database parameter background_dump_dest. I was originally mistakenly looking for the file in the user_dump_dest directory, but background_dump_dest is the correct one, as it is an Oracle background process being traced here.

Once the job has completed use the tkprof utility to convert the output to a summarised, readable format:

tkprof livedb_j000_1850.trc sql_trace.txt sys=no explain=username/password

Oracle data skew & statistics

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Mar 28th, 2008

I recently encountered a classic example of the Oracle Optimizer being fooled by data skew. James Morle, presented at Hotsos recently on how the vast majority of issues are caused by skew and latency, certainly I have seen many bad plans chosen by the Oracle optimizer that were the result of the optimizer not being aware of the distribution of the data.

So I have a view looking like the following:

 Name                       Null?    Type
--------------------------  -------   --------------
 KEY                        NOT NULL VARCHAR2(255)
 SUFFIX                     NOT NULL VARCHAR2(255)
 INSTANCE                   NOT NULL NUMBER(10)
 NS_ID                      NOT NULL NUMBER(10)
 CREATED                    NOT NULL DATE
 CREATED_BY                          VARCHAR2(200)
 REMOVED                             DATE
 REMOVED_BY                          VARCHAR2(200)

The type of query that we were having issues was of the form:

SELECT COUNT(*)
FROM NS_ON_DOMAINS
WHERE KEY = :B4
AND SUFFIX = :B3
AND INSTANCE = :B2
AND NS_ID = :B1
AND REMOVED IS NULL

First thing to bear in mind, is that key, suffix is pretty selective. ns_id, can be highly selective, but it can also be extremely unselective. I used the technique described by Greg Rahn to determine why the optimizer was choosing a particular plan that was providing a response time of the order of 3 minutes.

---------------------------------------------------------------------------------------------
| Id  | Operation   |Name|Starts|E-Rows|A-Rows|A-Time|Buffers|Reads                         |
---------------------------------------------------------------------------------------------
| 1 |SORT AGGREGATE  |  |  1 |     1 |     1 |00:02:07.29 | 19668 | 17459                   |
|*2 | TABLE ACCESS BY INDEX ROWID| NS_ON_DOMAINS|1 | 1 | 0 |00:02:07.29 |19668 |17459       |
|*3 |  INDEX RANGE SCAN          | IX_NS_ON_DOMAINS_NS | 1 |1 |49495 |00:00:00.32 |178 |177 |
---------------------------------------------------------------------------------------------

So the optimizer has chosen to use the IX_NS_ON_DOMAINS_NS with the NS_ID value, and thinks this will be highly selective. Very often this turns out to be the case, however in a particular case, this value is completely non-selective and the actual numbers of rows is quite huge.

Forcing the optimizer to use the other obvious index, the one on key, suffix, and instance we have the following:

-----------------------------------------------------------------------------
| Id  | Operation| Name |Starts|E-Rows|A-Rows|A-Time|Buffers|Reads          |
------------------------------------------------------------------------------
|  1|SORT AGGREGATE             |    |  1 |  1 | 1 |00:00:00.01 |4 |      1 |
|*2| TABLE ACCESS BY INDEX ROWID| NS_ON_DOMAINS |1 |1 |0 |00:00:00.01 |4 |1 |
|*3|  INDEX RANGE SCAN  | IX_NS_ON_DOMAINS_DOMAIN |1|3|0 |00:00:00.01 |4|1  |
-----------------------------------------------------------------------------

Now the optimizer is over estimating the number of rows it thinks it will find via this access path. This tells us why for the particular values of the bind variables we were looking at, that the plan chosen by the optimizer is not the most optimal path to the data. The optimizer is being tricked by the fact we have a very large skew on this data.

In the end we decided to use a hint for this particular query to force it to use the better index.

DBMS_MONITOR causing instance hanging

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Feb 25th, 2008

Recently we had an experience of a completely hung Oracle instance. This was a RAC database running version 10.2.0.3, only 1 instance in the 2 node cluster was affected, but there was no way of logging onto the affected instance either remotely or even as sysdba locally. Perhaps, we should have done a more thorough root cause analysis to find out why the system was hung using some of the techniques from this excerpt. I had never heard of the -prelim option to sqlplus before and it may have been interesting to perform a systemstate dump, but this was a busy production system and we had a really good idea what had caused it and just wanted to get back up and running again.

The system had been running along fine, until a colleague was examining an issue with a 3rd party piece of software and was using DBMS_MONITOR to obtain trace information of the connecting session. He noticed that every time he traced a connection from this application, it would suddenly make a new connection, which he then promptly would run DBMS_MONITOR on. This went on for around 6 sessions until eventually the instance hung to new connections and the helpdesk phones started to ring.

It was pretty clear we had to stop tracing ASAP, and right enough, once we had worked out which pid’s to kill server side the instance returned back to normal, just killing the sqlplus sessions on the client side was not actually doing the trick. Thankfully DBMS_MONITOR puts the pid number into the filename it uses to output the trace data to. However another way of obtaining which processes to kill would have been to run the following (we knew the traces were being run remotely from a specific client) on the still working node:

SQL> select machine, process
from gv$session;

It seems you can even kill a session running on a different node in a RAC cluster, so we probably could have even tried doing this via the sid, serial# information available from gv$session.

It turns out there is a bug with DBMS_MONITOR in Oracle 10.2.0.3, it is Bug 5485914 available in metalink, the notes for this state it may cause the instance to run slowly/hang. There is a patch available, also hopefully fixed in the imminent 10.2.0.4. I’m quite convinced it was the multiple times DBMS_MONITOR was running that caused the instance hang, so if you are on 10.2.0.3 be careful how many sessions you monitor at once.

Encountering an Oracle CRS-0213 error

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Feb 20th, 2008

This is really just a quick post about the cause of the Oracle error CRS-0213 which I encountered recently and found while doing a google search as well as looking this error up on metalink that there was very little information about this error out there, and nothing that actually helped me solve the problem.

I encountered this error as I was upgrading a RAC database from 10.2.0.3 to 11.1.0.6 and I was attempting to ensure CRS used the new ORACLE_HOME environment, i.e. the 11g one rather than the old 10g one. To do this for the listener you do the following:

/opt/oracle/product/crs/bin/srvctl remove listener -n nodename

/opt/oracle/product/crs/bin/srvctl add listener -n nodename -o NEW ORACLE_HOME

Note this is using the CRS_HOME rather than the ORACLE_HOME. However, when I ran the add part I encountered the following error:

[oracle@linuxrac1 ~]$ /opt/oracle/product/crs/bin/srvctl add listener -n linuxrac1 -o /opt/oracle/product/11.1.0
CRS-0213: Could not register resource 'ora.linuxrac1.LISTENER_LINUXRAC1.lsnr'.

Now the solution to this problem is not exactly earth shattering, but as I said I thought I would blog it anyway as CRS-0213 is not all well known and it might help someone out of a hole. We had the incorrect ORACLE_HOME, in true OFA style I included db_1 onto the end of the 11g ORACLE_HOME, so the above should read -o /opt/oracle/product/11.1.0/db_1. Running with the correct home works fine.

With having 3 separate Oracle software directories now, one for CRS, one for ASM and one for the RDBMS instance, I highly recommend you all look at implementing OFA if you do not already do so!

Cloning an ORACLE_HOME

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5 out of 5)
Loading ... Loading ...
Posted by jason on Feb 8th, 2008

Some time you may have a requirement to copy an ORACLE_HOME from one location to another using OS level commands. However, one thing this will not do is update the Oracle Inventory. Having an incomplete Inventory will screw up any attempts at subsequent patching of this cloned ORACLE_HOME.

oracle@bdb ~]$ opatch lsinventory
Invoking OPatch 10.2.0.4.2

Oracle Interim Patch Installer version 10.2.0.4.2
Copyright (c) 2007, Oracle Corporation.  All rights reserved.

Oracle Home       : /opt/oracle/product/10.2.0.2
Central Inventory : /home/oracle/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 10.2.0.4.2
OUI version       : 10.2.0.2.0
OUI location      : /opt/oracle/product/10.2.0.2/oui
Log file location : /opt/oracle/product/10.2.0.2/cfgtoollogs/opatch/opatch2008-02-08_11-58-44AM.log

List of Homes on this system:

  Home name= OraDb10g_home1, Location= "/opt/oracle/product/10.2.0"
  Home name= OraDb10g_home2, Location= "/opt/oracle/product/11.1.0"
Inventory load failed... OPatch cannot load inventory for the given Oracle Home.
Possible causes are:
   Oracle Home dir. path does not exist in Central Inventory
   Oracle Home is a symbolic link
   Oracle Home inventory is corrupted
LsInventorySession failed: OracleHomeInventory gets null oracleHomeInfo

OPatch failed with error code 73

Oracle clearly support the ability to clone an ORACLE_HOME as the oui installer has an option to attach a cloned home to the Inventory:

 runInstaller -attachHome ORACLE_HOME="/opt/oracle/product/10.2.0.2" ORACLE_HOME_NAME="ORADB_102"

After you have attached the cloned home you can quite happily apply interim patches and patchsets to your newly cloned ORACLE_HOME.

ORA-00600 error (arguments [13009], [5000], …) on Oracle 10.2.0.2 Database

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by patrick on Jan 7th, 2008

We experienced the following error on one of our Oracle test databases:

ORA-00600: internal error code, arguments: [13009], [5000], [1], [17], [1], [], [], []

The statement which generated this error was a simple select ... for update nowait:

SELECT rowid, key, suffix, status
FROM table1
WHERE KEY = 'example_key'
    AND suffix = 'co.uk'
    AND status = 0
FOR UPDATE NOWAIT;

ERROR at line 2:
ORA-00600: internal error code, arguments: [13009], [5000], [1], [17], [1], [], [], []

However, performing the same operation accessing the row via Oracle rowid was successful:

SELECT rowid, key, suffix, status
FROM table1
WHERE rowid = 'AAAPBRAAPAAAAI2AAk'
FOR UPDATE NOWAIT;

ROWID              KEY           SUFFIX                       STATUS
------------------ ------------- ---------- ------------------------
AAAPBRAAPAAAAI2AAk example-key   co.uk                             0

This led me to believe there must be a problem with the access method used by the first statement, probably a corruption on an index. Performing an explain plan revealed two indexes were used by the first statement:

explain plan for
SELECT rowid, key, suffix, status
FROM table1
WHERE KEY = 'example_key'
    AND suffix = 'co.uk'
    AND status = 0
FOR UPDATE NOWAIT;

---------------------------------------------------------------------------------------
| Id  | Operation                        | Name       | Rows  | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |            |     1 |    12   (9)| 00:00:01 |
|   1 |  FOR UPDATE                      |            |       |            |          |
|   2 |   BITMAP CONVERSION TO ROWIDS    |            |     1 |    12   (9)| 00:00:01 |
|   3 |    BITMAP AND                    |            |       |            |          |
|   4 |     BITMAP CONVERSION FROM ROWIDS|            |       |            |          |
|*  5 |      INDEX RANGE SCAN            | IX1_TABLE1 |     1 |     3   (0)| 00:00:01 |
|   6 |     BITMAP CONVERSION FROM ROWIDS|            |       |            |          |
|   7 |      SORT ORDER BY               |            |       |            |          |
|*  8 |       INDEX RANGE SCAN           | PK_TABLE1  |     1 |     8   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

I decided to validate the structure of the table and indexes, but this revealed no errors:

analyze table table1 validate structure cascade;
Table analyzed.

I next rebuilt the indexes used by the query:

alter index pk_table1 rebuild online;
Index altered.

alter index ix1_table1 rebuild online;
Index altered.

This resolved the problem:

SELECT rowid, key, suffix, status
FROM table1
WHERE KEY = 'example_key'
    AND suffix = 'co.uk'
    AND status = 0
FOR UPDATE NOWAIT;

ROWID              KEY           SUFFIX                       STATUS
------------------ ------------- ---------- ------------------------
AAAPBRAAPAAAAI2AAk example-key   co.uk                             0

Oracle Linguistic Indexes

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Jan 4th, 2008

Quite a bit of synchronicity occurred recently to the dba team here at Nominet. We have been following the writings of Richard Foote and in particular an article on Linguistic Indexes. I thought the article interesting though somewhat obscure and filed it at the back of mind.

We have been working on an upgrade to our ticketing/service management system and we came across a custom written query that was performing like a dog when run by the application, but producing fast performance when run by other schema owners:

SELECT count (*) FROM   user.domains d WHERE  d.key = 'nominet';

------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                |     1 |    14 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |                |     1 |    14 |            |          |
|*  2 |   INDEX RANGE SCAN| PK_DOMAINS_IDX |     1 |    14 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

But when run as the user the application connects as:

SELECT count (*) FROM   user.domains d WHERE  d.key = 'nominet';

--------------------------------------------------------------------------------------------
| Id  | Operation                 | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                |     1 |    14 |  5209   (3)| 00:01:03 |
|   1 |  SORT AGGREGATE           |                |     1 |    14 |            |          |
|   2 |   PX COORDINATOR          |                |       |       |            |          |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000       |     1 |    14 |            |          |
|   4 |     SORT AGGREGATE        |                |     1 |    14 |            |          |
|   5 |      PX BLOCK ITERATOR    |                |     1 |    14 |  5209   (3)| 00:01:03 |
|*  6 |       INDEX FAST FULL SCAN| PK_DOMAINS_IDX |     1 |    14 |  5209   (3)| 00:01:03 |
--------------------------------------------------------------------------------------------

We had other examples where a primary key index was being ignored by the application schema user which was doing a full table scan of a large table instead. The obvious conclusion to jump to (and the immediate one which we did), is a difference in the optimizer environments for the two schemas. We checked this using V$sql_optimizer_env, but the different child cursors produced for the query had the same optimizer environment settings. I even did a 10132 event dump of running the query in the 2 schemas, but I could not for the life of me put my finger on what was producing the differing plans.

It had to be something in the environment of the application schema, so we looked at roles & privileges until eventually we looked at logon triggers, and there it was:

begin
	       execute immediate 'alter session set NLS_COMP=LINGUISTIC';
	       execute immediate 'alter session set NLS_SORT=BINARY_CI';
	    end;

There is a good chapter in the Oracle documentation on Linguistic sorting, which you can read for yourself. There are a couple of things that are surprising for me, first these nls changes to a session do not seem to get into the V$sql_optimizer_env or other exposed views of what the optimizer environment was at run time, they clearly can have a large impact on the explain plan generated. Secondly, I don’t understand what was wrong with the traditional function-based index and having UPPER(column) whenever a case-insensitive search is required.

Of course in the interest of full disclosure I should point out that full explain plan had the following hiding after the plan:

       filter(NLSSORT("KEY",'nls_sort=''BINARY_CI''')=HEXTORAW('69747600') )

Be Careful when setting log_archive_max_processes

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by jason on Jan 3rd, 2008

Anyone interested in high-availability in an Oracle setting would do well to read the MAA pages on the Oracle website, there are many good papers and guidelines for optimising your environment contained there. For dataguard, I highly recommend the best practices for Redo Transport & Redo Apply papers. One recommendation contained there is increasing the log_archive_max_processes parameter, which controls the number of arc processes the instance uses, which in turn can help resolve any archive gaps. Unfortunately, I think I may have over egged this one, and as they say nothing is for free. Each server process takes some memory for it’s PGA and Oracle server background processes are no exception.

Lately, our monitoring system has been warning about low free memory on one of our RAC nodes, which seemed to be correlated to the number of oracle processes running. Hence I started looking at how much memory we were using for our pga:

SQL> select * from V$PGASTAT;

NAME		                        VALUE     UNIT
--------------------------------    ---------- ------------
aggregate PGA target parameter        199229440   bytes
aggregate PGA auto target              12451840   bytes
global memory bound                    39845888   bytes
total PGA inuse                      3658247168   bytes
total PGA allocated                  5270902784   bytes
maximum PGA allocated                6450857984   bytes
total freeable PGA memory             320667648   bytes
.
.

So this Instance had allocated 5GB to the pga, it’s interesting to note that the oracle documentation state that the total pga allocated will try to be kept under the aggegrate pga target, but we are an order of magnitude over here. I had thought the pga target parameter was maximum amount that could be allocated to each session, rather than a global total. Certainly my 190MB pga_aggregate_target is being overallocated by a long way.

I then looked at V$process to check what processes were using the most pga:

SQL> SELECT PROGRAM, PGA_USED_MEM, PGA_ALLOC_MEM/(1024*1024) "ALLOC(MB)",
PGA_FREEABLE_MEM FREEABLE, PGA_MAX_MEM
FROM V$PROCESS
order by pga_alloc_mem asc;

PROGRAM 	        PGA_USED_MEM  ALLOC(MB)FREEABLE PGA_MAX_MEM
----------------------  ------------  --------- ----- -------------
.
.
oracle@PROD (ARCg)	24274349     48.0507708   0    50384885
oracle@PROD (ARCf)	24276029     48.0507708   0    50384885
oracle@PROD (ARCe)	24274205     48.0507708   0    50384885
oracle@PROD (ARCd)	24277965     48.0507708   0    50384885
oracle@PROD (ARCc)	24259573     48.0507708   0    50384885
oracle@PROD (ARC5)	24274389     48.0507708   0    50384885
.
.

Basically, every archive background process was consuming nearly 50MB of memory. All 30 of them. That is 1.5GB of RAM. Thankfully, the log_archive_max_processes parameter is dynamic and I could remedy the situation with a simple alter system command, and after a little wait the unrequired arc processes were removed and the system had a lot more free memory.

Testing database locking with autonomous transactions

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.5 out of 5)
Loading ... Loading ...
Posted by chris on Dec 21st, 2007

We recently hit a problem caused by a database row lock. The actual update being made was not particularly important, as it was just a denormalized value pre-calculated for convenience. However, the consequences of the lock were rather nasty, with many processes waiting for the lock to be released. As a result I decided to change the code to allow the update to be deferred. So if the update encountered a lock, it would write its change to a queue table instead, to be applied later. I did this by opening a cursor with FOR UPDATE NOWAIT. This will throw an exception if it can’t get an immediate lock on the relevant rows.

So far so good, but how do you test something like this? The code in question has been around quite a while and is in PL/SQL (these days most new code we write is in Java). This means we use utPLSQL to test this functionality and so the tests themselves are also written in PL/SQL. The answer is the use of autonomous transactions. These allow you to mark a PL/SQL program unit with a special marker PRAGMA AUTONOMOUS_TRANSACTION that makes it run in its own transaction. So if the code we are testing is in the Make_The_Update procedure of the MyPackage package, the test code would look something like this:

.....

    --This Procedure does its thing in another
    --transaction to force an error to occur
    PROCEDURE Update_Externally IS
      PRAGMA AUTONOMOUS_TRANSACTION;
    BEGIN
       MyPackage.Make_The_Update(1000);
       COMMIT;
    END;

    PROCEDURE ut_check_locking IS
    BEGIN

      --Make an update in the first transaction, but don't commit
      MyPackage.Make_The_Update(10);
      --Make an update in the second transaction
      Update_Externally;
      --Check that it wrote to the queue table here
      ......

    END;

.....

For those unfamiliar with utPLSQL, test procedures are named ut_*. The other procedure is just there for convenience. I have left out much of the boilerplate, such as the setup and teardown code and so on.

So let’s look at what it does. The test procedure ut_check_locking makes an update, but doesn’t commit, then calls the other procedure to attempt an update in a different transaction. (Note that if our code wasn’t designed to deal with this situation, this would be a surefire way to generate a deadlock. Luckily Oracle detects these quite quickly, so if you make a mistake you’ll just get an exception after a few seconds). Finally, we check if the relevant rows have been written to our queue table. Doing this in utPLSQL is left as an exercise for the reader.

I initially had some problems with this, because I forgot that any test data needs to be visible from both transactions. Initially the main transaction just created it and then rolled it back at the end. This gave me some odd ORA-01410: Invalid ROWID errors as I was trying to update non-existent rows. Then I realised that I needed to explicitly commit this data in the setup.

I’ve used autonomous transactions in production code for doing things such as logging when you want the main transaction to roll back. But I’ve not seen them used as an aid to test issues around concurrent data access. But they are certainly helpful in that situation too.

DBCA and large filesystems

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 1.67 out of 5)
Loading ... Loading ...
Posted by jason on Nov 9th, 2007

I never used to like dbca for creating a database, I much prefered the more hardcore (and more configurable) method of creating a script to create the database, and indeed all the production databases created here at nominet have been scripted rather than gui’d into existence. However I have found myself using dbca more and more, particularly since we upgraded to Oracle 10.2 a while back, dbca is much more useable, I used to find this java painfully slow, but 10.2 performs much faster. The reason I’ve been using it more is that our developers want empty databases that they then run the auto schema building tools on to give them the schemas to develop against but without data. I find dbca very useful for creating this type of database.
However it seems like dbca (10.2.0.3) has a limit on the size of filesystem that it can understand and throws an error when attempting to install onto a filesystem that is larger than it understands:

dbca file overflow

There was actually around 16TB of free space available rather than the -1 the gui is suggesting. Thankfully clicking ignore allowed the installation to proceed successfully.

« Prev - Next »

Recent Posts

Highest Rated

Categories

Archives

Meta: