random technical thoughts from the Nominet technical team

Nameservers and very large zones

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5 out of 5)
Loading ... Loading ...
Posted by jay on Jun 2nd, 2008

There comes a point in a zone’s life when it gets too big to be held in memory. For TLDs we really only have .com that has reached this scale, but ENUM zones with this problem are numerous.

So if you want to run an authoritative nameserver for these zones you basically have these options:

  1. Use a DB plugin to an existing nameserver like BIND. However the performance from this is so poor that this is not a reasonable option in most scenarios.
  2. Buy an off-the-shelf nameserver like ANS from Nominum. However this is only a real option for the fantastically rich or those with such a controlled network they can make do with just two nameserver instances.
  3. Use a service provider that runs the zones for you and has their own database back end technology. We use UltraDNS who have this functionality.
  4. Write your own. Of course this is all Nominum and UltraDNS have done and more recently CommunityDNS, so how hard can it be.

Common misconceptions about a database back end

So how exactly do you go about designing the database back end for a nameserver? Well, in my opinion most people start back to front and continue that way.

What gets most people excited about using databases as a back end is three things. Each of which is quite wrong and can be dismissed in turn.

Use the main registration database

If you start using databases then why not have the nameserver run off the main registration database. That is already a fault tolerant cluster and it would mean that the nameserver was always instantly up to date.

This is madness for several reasons. First, no database has 100% uptime but a nameserver cluster should. To be clear, that does not mean every nameserver available 100% of the time, but at least one nameserver can be accessed 100% of the time.

Then there is the issue of zone file serial numbers. Are you going to update the serial number for every single update of the database (yes you have to)? What happens when things go wrong and need to be unwound? What happens if you need to restore from a backup?

There are also the performance issues from the way the data is stored. It is likely to be optimised for the registration system, not the nameserver system that pulls it off.

Finally this limits you to nameservers that are connected to the database by a reliable, fast and secure channel. Maybe possible in a single enterprise but not to be tried across the public Internet.

In essence running it this way is just too brittle and should not be considered. To be honest I don’t think any experienced people would think of that, but I wanted to make sure I covered all the options.

Multi-headed nameserver

The second cause for excitement is the possibility of a multi-headed nameserver. In other words, lots of front ends all dealing with the same database back end (which we assume is separate from the source database). The reasoning for this is that databases are large, expensive beasts, optimised for handling data requests from many clients. Whereas the front ends are much lighter, simpler beasts that are optimised around network processing. So fitting the two together seems to be a natural fit.

It also means that updates are only processed by one machine, the database, not directly impacting the others. Fewer updates need to be sent and there is less chance of inconsistency of the data.

In order to see why this is wrong (in most cases) we need to think about what kind of database do we really need for a nameserver. It turns out that we don’t actually need most of the features found in a modern RDBMS. For example we don’t need views, stored procedures, pluggable indexes etc. All we need is a simple, fast and efficient database, which doesn’t require hardware of a different nature to the front end.

The next point is to examine where the bottlenecks of performance are in that setup. The most obvious one is whenever the front end needs to ask the database for the data rather than use its cache. It starts with the network transfer then the contention for the lookup, it may have to go to disk, then the network transfer between the front end and back end. I would contend that with a multi-headed setup these bottlenecks are much worse than a simple 1-1 configuration of one front end and one database both on the same machine.

Note that one of the bottlenecks is not the processing of updates. The time spent processing updates is about one thousandth the time spent handling requests (or less) and so has no impact of the overall performance unless it is very badly implemented. Having more updates as a result of the 1-1 config is not going to outweigh the benefits.

Database replication

The third cause for excitement is the possibility of using database replication. It brings up a vision of nameservers magically being up to date across a whole cluster without anyone doing anything. After all databases are supposed to be good at replication.

This is the easiest point to rebut. Database replication is general purpose and proprietary. Whereas DNS already has an open standard, tried and tested and DNS specific database replication mechanism - AXFR and IXFR. Yes I hear you say, but using database replication means that you don’t need to serialise the data into DNS and back again and translate between DNS packet structure and the database structure.

Well for a start, the database structure should be very close to the DNS packet structure, not much point in having it any other way. Then of course even database replication has to do the serialisation at some point or another, so is there any thing actually gained by it?

The real killer though is the impact on caches. A sensible nameserver implementation will have multiple levels of data store, with a small, fast, pre-compiled cache first in line to check, then the database next which will aim for an in-memory hit first and finally a disk hit if all else fails. If you allow data to enter the nameserver directly at the database level then you need a mechanism for the database to signal the cache that it may have invalid data, and then have the cache refresh itself.

In my view this is simply too complex and will just slow down performance. Far easier to do away with it and have all updates come in the front door, where the nameserver knows they are and so can update its cache accordingly.

Ideal nameserver design

So putting this altogether gives me an ideal design for the database back end in a nameserver. As you’ve probably guessed, the big emphasis for me has been on performance, but that’s because adding a database is bound to be slower than an in-memory representation and so everything possible needs to be done to compensate for that.

In this design then the database is a lightweight embedded database in every single instance of the nameserver. It is stripped of all unnecessary features and the data tables and indexes are optimised solely for great DNS performance.

All access to this database is through the front end - either IXFR, AXFR, DDNS or a control process. No round-the-back access to the database is possible, it is hermetically sealed.

You may even be lucky and find a database used like this obviates the need for a separate cache.

Toad for Oracle: Illegible PL/SQL Comments

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by patrick on May 16th, 2008

I found after I upgraded Toad for Oracle from 8.6.1 to 9.6.1, I could no longer read comments in PL/SQL code. They appeared as illegible, nonsense, unreadable text.

This is actually a problem with displaying the comments as italics. To avoid this, change the displaying of comments to be non-italicised as described below.

From the menu select:
View –> Toad Options –> Editor –> Behavior

In the Languages box, ensure PL/SQL is present in the drop down list, then select “Syntax highlighting”.

On the Highlighting tab, select “Comment” from the Styles list. Untick “Italic” in the Font style box and then press the Apply button.

PL/SQL comments should now be displayed legibly.

Tracing Oracle Procedures submitted via dbms_job

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5 out of 5)
Loading ... Loading ...
Posted by patrick on May 2nd, 2008

In order to investigate a performance problem in our Oracle database, I needed to trace one of our batch PL/SQL procedures, which has been submitted via the dbms_job package to execute every morning. I had traced the procedure successfully when executing it manually from a SQL*Plus session: a SQL trace file was produced in the directory defined by the database parameter user_dump_dest. Below is a description of how I then went on to trace the procedure when it was executing as a job submitted via dbms_job.

Assume a job has been submitted to execute every morning at 9am:

variable job_num number
exec dbms_job.submit (:job_num, my_procedure, sysdate, '(trunc(sysdate + 1) + 9/24)');
commit;

From a SQL*Plus session examine dba_jobs_running, to verify that the job has started and obtain its session identifier (sid):

select * from dba_jobs_running;

 SID  JOB FAILURES LAST_DATE LAST_SEC THIS_DATE           THIS_SEC INSTANCE
---- ---- -------- --------- -------- ------------------- -------- --------
 356  467                             02/05/2008 14:37:35 14:37:35        0

Obtain the Oracle serial number (serial#) for this database session:

select sid, serial# from v$session where sid = 356;

SID SERIAL#
--- -------
356     701

Use the dbms_monitor.session_trace_enable procedure to start SQL tracing for this session:

exec dbms_monitor.session_trace_enable (356, 701);

A SQL trace file will be produced in the directory defined by the database parameter background_dump_dest. I was originally mistakenly looking for the file in the user_dump_dest directory, but background_dump_dest is the correct one, as it is an Oracle background process being traced here.

Once the job has completed use the tkprof utility to convert the output to a summarised, readable format:

tkprof livedb_j000_1850.trc sql_trace.txt sys=no explain=username/password

Oracle data skew & statistics

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Mar 28th, 2008

I recently encountered a classic example of the Oracle Optimizer being fooled by data skew. James Morle, presented at Hotsos recently on how the vast majority of issues are caused by skew and latency, certainly I have seen many bad plans chosen by the Oracle optimizer that were the result of the optimizer not being aware of the distribution of the data.

So I have a view looking like the following:

 Name                       Null?    Type
--------------------------  -------   --------------
 KEY                        NOT NULL VARCHAR2(255)
 SUFFIX                     NOT NULL VARCHAR2(255)
 INSTANCE                   NOT NULL NUMBER(10)
 NS_ID                      NOT NULL NUMBER(10)
 CREATED                    NOT NULL DATE
 CREATED_BY                          VARCHAR2(200)
 REMOVED                             DATE
 REMOVED_BY                          VARCHAR2(200)

The type of query that we were having issues was of the form:

SELECT COUNT(*)
FROM NS_ON_DOMAINS
WHERE KEY = :B4
AND SUFFIX = :B3
AND INSTANCE = :B2
AND NS_ID = :B1
AND REMOVED IS NULL

First thing to bear in mind, is that key, suffix is pretty selective. ns_id, can be highly selective, but it can also be extremely unselective. I used the technique described by Greg Rahn to determine why the optimizer was choosing a particular plan that was providing a response time of the order of 3 minutes.

---------------------------------------------------------------------------------------------
| Id  | Operation   |Name|Starts|E-Rows|A-Rows|A-Time|Buffers|Reads                         |
---------------------------------------------------------------------------------------------
| 1 |SORT AGGREGATE  |  |  1 |     1 |     1 |00:02:07.29 | 19668 | 17459                   |
|*2 | TABLE ACCESS BY INDEX ROWID| NS_ON_DOMAINS|1 | 1 | 0 |00:02:07.29 |19668 |17459       |
|*3 |  INDEX RANGE SCAN          | IX_NS_ON_DOMAINS_NS | 1 |1 |49495 |00:00:00.32 |178 |177 |
---------------------------------------------------------------------------------------------

So the optimizer has chosen to use the IX_NS_ON_DOMAINS_NS with the NS_ID value, and thinks this will be highly selective. Very often this turns out to be the case, however in a particular case, this value is completely non-selective and the actual numbers of rows is quite huge.

Forcing the optimizer to use the other obvious index, the one on key, suffix, and instance we have the following:

-----------------------------------------------------------------------------
| Id  | Operation| Name |Starts|E-Rows|A-Rows|A-Time|Buffers|Reads          |
------------------------------------------------------------------------------
|  1|SORT AGGREGATE             |    |  1 |  1 | 1 |00:00:00.01 |4 |      1 |
|*2| TABLE ACCESS BY INDEX ROWID| NS_ON_DOMAINS |1 |1 |0 |00:00:00.01 |4 |1 |
|*3|  INDEX RANGE SCAN  | IX_NS_ON_DOMAINS_DOMAIN |1|3|0 |00:00:00.01 |4|1  |
-----------------------------------------------------------------------------

Now the optimizer is over estimating the number of rows it thinks it will find via this access path. This tells us why for the particular values of the bind variables we were looking at, that the plan chosen by the optimizer is not the most optimal path to the data. The optimizer is being tricked by the fact we have a very large skew on this data.

In the end we decided to use a hint for this particular query to force it to use the better index.

DBMS_MONITOR causing instance hanging

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Feb 25th, 2008

Recently we had an experience of a completely hung Oracle instance. This was a RAC database running version 10.2.0.3, only 1 instance in the 2 node cluster was affected, but there was no way of logging onto the affected instance either remotely or even as sysdba locally. Perhaps, we should have done a more thorough root cause analysis to find out why the system was hung using some of the techniques from this excerpt. I had never heard of the -prelim option to sqlplus before and it may have been interesting to perform a systemstate dump, but this was a busy production system and we had a really good idea what had caused it and just wanted to get back up and running again.

The system had been running along fine, until a colleague was examining an issue with a 3rd party piece of software and was using DBMS_MONITOR to obtain trace information of the connecting session. He noticed that every time he traced a connection from this application, it would suddenly make a new connection, which he then promptly would run DBMS_MONITOR on. This went on for around 6 sessions until eventually the instance hung to new connections and the helpdesk phones started to ring.

It was pretty clear we had to stop tracing ASAP, and right enough, once we had worked out which pid’s to kill server side the instance returned back to normal, just killing the sqlplus sessions on the client side was not actually doing the trick. Thankfully DBMS_MONITOR puts the pid number into the filename it uses to output the trace data to. However another way of obtaining which processes to kill would have been to run the following (we knew the traces were being run remotely from a specific client) on the still working node:

SQL> select machine, process
from gv$session;

It seems you can even kill a session running on a different node in a RAC cluster, so we probably could have even tried doing this via the sid, serial# information available from gv$session.

It turns out there is a bug with DBMS_MONITOR in Oracle 10.2.0.3, it is Bug 5485914 available in metalink, the notes for this state it may cause the instance to run slowly/hang. There is a patch available, also hopefully fixed in the imminent 10.2.0.4. I’m quite convinced it was the multiple times DBMS_MONITOR was running that caused the instance hang, so if you are on 10.2.0.3 be careful how many sessions you monitor at once.

Encountering an Oracle CRS-0213 error

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Feb 20th, 2008

This is really just a quick post about the cause of the Oracle error CRS-0213 which I encountered recently and found while doing a google search as well as looking this error up on metalink that there was very little information about this error out there, and nothing that actually helped me solve the problem.

I encountered this error as I was upgrading a RAC database from 10.2.0.3 to 11.1.0.6 and I was attempting to ensure CRS used the new ORACLE_HOME environment, i.e. the 11g one rather than the old 10g one. To do this for the listener you do the following:

/opt/oracle/product/crs/bin/srvctl remove listener -n nodename

/opt/oracle/product/crs/bin/srvctl add listener -n nodename -o NEW ORACLE_HOME

Note this is using the CRS_HOME rather than the ORACLE_HOME. However, when I ran the add part I encountered the following error:

[oracle@linuxrac1 ~]$ /opt/oracle/product/crs/bin/srvctl add listener -n linuxrac1 -o /opt/oracle/product/11.1.0
CRS-0213: Could not register resource 'ora.linuxrac1.LISTENER_LINUXRAC1.lsnr'.

Now the solution to this problem is not exactly earth shattering, but as I said I thought I would blog it anyway as CRS-0213 is not all well known and it might help someone out of a hole. We had the incorrect ORACLE_HOME, in true OFA style I included db_1 onto the end of the 11g ORACLE_HOME, so the above should read -o /opt/oracle/product/11.1.0/db_1. Running with the correct home works fine.

With having 3 separate Oracle software directories now, one for CRS, one for ASM and one for the RDBMS instance, I highly recommend you all look at implementing OFA if you do not already do so!

Cloning an ORACLE_HOME

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Feb 8th, 2008

Some time you may have a requirement to copy an ORACLE_HOME from one location to another using OS level commands. However, one thing this will not do is update the Oracle Inventory. Having an incomplete Inventory will screw up any attempts at subsequent patching of this cloned ORACLE_HOME.

oracle@bdb ~]$ opatch lsinventory
Invoking OPatch 10.2.0.4.2

Oracle Interim Patch Installer version 10.2.0.4.2
Copyright (c) 2007, Oracle Corporation.  All rights reserved.

Oracle Home       : /opt/oracle/product/10.2.0.2
Central Inventory : /home/oracle/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 10.2.0.4.2
OUI version       : 10.2.0.2.0
OUI location      : /opt/oracle/product/10.2.0.2/oui
Log file location : /opt/oracle/product/10.2.0.2/cfgtoollogs/opatch/opatch2008-02-08_11-58-44AM.log

List of Homes on this system:

  Home name= OraDb10g_home1, Location= "/opt/oracle/product/10.2.0"
  Home name= OraDb10g_home2, Location= "/opt/oracle/product/11.1.0"
Inventory load failed... OPatch cannot load inventory for the given Oracle Home.
Possible causes are:
   Oracle Home dir. path does not exist in Central Inventory
   Oracle Home is a symbolic link
   Oracle Home inventory is corrupted
LsInventorySession failed: OracleHomeInventory gets null oracleHomeInfo

OPatch failed with error code 73

Oracle clearly support the ability to clone an ORACLE_HOME as the oui installer has an option to attach a cloned home to the Inventory:

 runInstaller -attachHome ORACLE_HOME="/opt/oracle/product/10.2.0.2" ORACLE_HOME_NAME="ORADB_102"

After you have attached the cloned home you can quite happily apply interim patches and patchsets to your newly cloned ORACLE_HOME.

ORA-00600 error (arguments [13009], [5000], …) on Oracle 10.2.0.2 Database

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by patrick on Jan 7th, 2008

We experienced the following error on one of our Oracle test databases:

ORA-00600: internal error code, arguments: [13009], [5000], [1], [17], [1], [], [], []

The statement which generated this error was a simple select ... for update nowait:

SELECT rowid, key, suffix, status
FROM table1
WHERE KEY = 'example_key'
    AND suffix = 'co.uk'
    AND status = 0
FOR UPDATE NOWAIT;

ERROR at line 2:
ORA-00600: internal error code, arguments: [13009], [5000], [1], [17], [1], [], [], []

However, performing the same operation accessing the row via Oracle rowid was successful:

SELECT rowid, key, suffix, status
FROM table1
WHERE rowid = 'AAAPBRAAPAAAAI2AAk'
FOR UPDATE NOWAIT;

ROWID              KEY           SUFFIX                       STATUS
------------------ ------------- ---------- ------------------------
AAAPBRAAPAAAAI2AAk example-key   co.uk                             0

This led me to believe there must be a problem with the access method used by the first statement, probably a corruption on an index. Performing an explain plan revealed two indexes were used by the first statement:

explain plan for
SELECT rowid, key, suffix, status
FROM table1
WHERE KEY = 'example_key'
    AND suffix = 'co.uk'
    AND status = 0
FOR UPDATE NOWAIT;

---------------------------------------------------------------------------------------
| Id  | Operation                        | Name       | Rows  | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |            |     1 |    12   (9)| 00:00:01 |
|   1 |  FOR UPDATE                      |            |       |            |          |
|   2 |   BITMAP CONVERSION TO ROWIDS    |            |     1 |    12   (9)| 00:00:01 |
|   3 |    BITMAP AND                    |            |       |            |          |
|   4 |     BITMAP CONVERSION FROM ROWIDS|            |       |            |          |
|*  5 |      INDEX RANGE SCAN            | IX1_TABLE1 |     1 |     3   (0)| 00:00:01 |
|   6 |     BITMAP CONVERSION FROM ROWIDS|            |       |            |          |
|   7 |      SORT ORDER BY               |            |       |            |          |
|*  8 |       INDEX RANGE SCAN           | PK_TABLE1  |     1 |     8   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

I decided to validate the structure of the table and indexes, but this revealed no errors:

analyze table table1 validate structure cascade;
Table analyzed.

I next rebuilt the indexes used by the query:

alter index pk_table1 rebuild online;
Index altered.

alter index ix1_table1 rebuild online;
Index altered.

This resolved the problem:

SELECT rowid, key, suffix, status
FROM table1
WHERE KEY = 'example_key'
    AND suffix = 'co.uk'
    AND status = 0
FOR UPDATE NOWAIT;

ROWID              KEY           SUFFIX                       STATUS
------------------ ------------- ---------- ------------------------
AAAPBRAAPAAAAI2AAk example-key   co.uk                             0

Oracle Linguistic Indexes

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Jan 4th, 2008

Quite a bit of synchronicity occurred recently to the dba team here at Nominet. We have been following the writings of Richard Foote and in particular an article on Linguistic Indexes. I thought the article interesting though somewhat obscure and filed it at the back of mind.

We have been working on an upgrade to our ticketing/service management system and we came across a custom written query that was performing like a dog when run by the application, but producing fast performance when run by other schema owners:

SELECT count (*) FROM   user.domains d WHERE  d.key = 'nominet';

------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                |     1 |    14 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |                |     1 |    14 |            |          |
|*  2 |   INDEX RANGE SCAN| PK_DOMAINS_IDX |     1 |    14 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

But when run as the user the application connects as:

SELECT count (*) FROM   user.domains d WHERE  d.key = 'nominet';

--------------------------------------------------------------------------------------------
| Id  | Operation                 | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                |     1 |    14 |  5209   (3)| 00:01:03 |
|   1 |  SORT AGGREGATE           |                |     1 |    14 |            |          |
|   2 |   PX COORDINATOR          |                |       |       |            |          |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000       |     1 |    14 |            |          |
|   4 |     SORT AGGREGATE        |                |     1 |    14 |            |          |
|   5 |      PX BLOCK ITERATOR    |                |     1 |    14 |  5209   (3)| 00:01:03 |
|*  6 |       INDEX FAST FULL SCAN| PK_DOMAINS_IDX |     1 |    14 |  5209   (3)| 00:01:03 |
--------------------------------------------------------------------------------------------

We had other examples where a primary key index was being ignored by the application schema user which was doing a full table scan of a large table instead. The obvious conclusion to jump to (and the immediate one which we did), is a difference in the optimizer environments for the two schemas. We checked this using V$sql_optimizer_env, but the different child cursors produced for the query had the same optimizer environment settings. I even did a 10132 event dump of running the query in the 2 schemas, but I could not for the life of me put my finger on what was producing the differing plans.

It had to be something in the environment of the application schema, so we looked at roles & privileges until eventually we looked at logon triggers, and there it was:

begin
	       execute immediate 'alter session set NLS_COMP=LINGUISTIC';
	       execute immediate 'alter session set NLS_SORT=BINARY_CI';
	    end;

There is a good chapter in the Oracle documentation on Linguistic sorting, which you can read for yourself. There are a couple of things that are surprising for me, first these nls changes to a session do not seem to get into the V$sql_optimizer_env or other exposed views of what the optimizer environment was at run time, they clearly can have a large impact on the explain plan generated. Secondly, I don’t understand what was wrong with the traditional function-based index and having UPPER(column) whenever a case-insensitive search is required.

Of course in the interest of full disclosure I should point out that full explain plan had the following hiding after the plan:

       filter(NLSSORT("KEY",'nls_sort=''BINARY_CI''')=HEXTORAW('69747600') )

Be Careful when setting log_archive_max_processes

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by jason on Jan 3rd, 2008

Anyone interested in high-availability in an Oracle setting would do well to read the MAA pages on the Oracle website, there are many good papers and guidelines for optimising your environment contained there. For dataguard, I highly recommend the best practices for Redo Transport & Redo Apply papers. One recommendation contained there is increasing the log_archive_max_processes parameter, which controls the number of arc processes the instance uses, which in turn can help resolve any archive gaps. Unfortunately, I think I may have over egged this one, and as they say nothing is for free. Each server process takes some memory for it’s PGA and Oracle server background processes are no exception.

Lately, our monitoring system has been warning about low free memory on one of our RAC nodes, which seemed to be correlated to the number of oracle processes running. Hence I started looking at how much memory we were using for our pga:

SQL> select * from V$PGASTAT;

NAME		                        VALUE     UNIT
--------------------------------    ---------- ------------
aggregate PGA target parameter        199229440   bytes
aggregate PGA auto target              12451840   bytes
global memory bound                    39845888   bytes
total PGA inuse                      3658247168   bytes
total PGA allocated                  5270902784   bytes
maximum PGA allocated                6450857984   bytes
total freeable PGA memory             320667648   bytes
.
.

So this Instance had allocated 5GB to the pga, it’s interesting to note that the oracle documentation state that the total pga allocated will try to be kept under the aggegrate pga target, but we are an order of magnitude over here. I had thought the pga target parameter was maximum amount that could be allocated to each session, rather than a global total. Certainly my 190MB pga_aggregate_target is being overallocated by a long way.

I then looked at V$process to check what processes were using the most pga:

SQL> SELECT PROGRAM, PGA_USED_MEM, PGA_ALLOC_MEM/(1024*1024) "ALLOC(MB)",
PGA_FREEABLE_MEM FREEABLE, PGA_MAX_MEM
FROM V$PROCESS
order by pga_alloc_mem asc;

PROGRAM 	        PGA_USED_MEM  ALLOC(MB)FREEABLE PGA_MAX_MEM
----------------------  ------------  --------- ----- -------------
.
.
oracle@PROD (ARCg)	24274349     48.0507708   0    50384885
oracle@PROD (ARCf)	24276029     48.0507708   0    50384885
oracle@PROD (ARCe)	24274205     48.0507708   0    50384885
oracle@PROD (ARCd)	24277965     48.0507708   0    50384885
oracle@PROD (ARCc)	24259573     48.0507708   0    50384885
oracle@PROD (ARC5)	24274389     48.0507708   0    50384885
.
.

Basically, every archive background process was consuming nearly 50MB of memory. All 30 of them. That is 1.5GB of RAM. Thankfully, the log_archive_max_processes parameter is dynamic and I could remedy the situation with a simple alter system command, and after a little wait the unrequired arc processes were removed and the system had a lot more free memory.

Next »

Recent Posts

Highest Rated

Categories

Archives

Meta: