In his comment on the last post, John Beresniewicz mentioned the following property of DB CPU :-
"… ‘DB time’
fully contains ‘DB CPU’ and both are measured values (which means
in particular that ‘DB CPU’ is not inflated by runq time)"
I’d like to talk about the end of that sentence in particular. About a year and a half ago, I wrote a paper (that’s a PDF link) about building response time profiles for sessions that use Parallel Execution. One of the problems I highlighted in that paper was that PX jobs are prone to overload the available CPU resources simply because of the number of PX slave processes that are running concurrently. It’s mentioned first in the section on "More Lost Time" on page 4 but most of the rest of the paper goes on to discuss this particular problem. I wasn’t very successful coming up with any solutions beyond my usual approach of combining Oracle and O/S tools.
Essentially, as the queue of runnable processes that are waiting to execute on a CPU grows (I’ll just refer to it as the run queue from now on), the Oracle server connection processes appear to lose time because they’re unable to get on to a CPU in order to record the end time of an event. In the Response Time profilers I’ve seen, this time is recorded as unaccounted-for and it’s what JB’s referring to when he talks about runq time. Actually, time spent in the run queue is not the only cause of unaccounted-for time but it’s the one I’ll focus on here. (Here’s another plug for Cary and Jeff’s book - other reasons are discussed in chapter 7 on Oracle Kernel Timings.)
As JB stated, DB CPU does not include this time spent waiting in the run queue but the ASH samples for the session are aware that the server connection process is on the CPU or in the run queue. As a consequence, you can see apparently contradictory graphs in the various OEM (DB Control or Grid Control) Performance pages. Although I’ve tried this with the PX tests and confirmed the same behaviour, I’m going to use Swingbench for the examples here to continue the approach so far and so that you can try them out yourself if you want. (I might post PX examples later.)
Swingbench is supplied with several example applications, one of which is the Sales History (SH) application which uses the SH example schema supplied with recent versions of Oracle. The key attribute of this test for my purpose is that it’s read only and CPU intensive, so that the limited I/O bandwidth on a home PC isn’t the bottleneck.
I ran 8 concurrent sessions of the SH application for a total duration of 5 minutes with no delay between transactions, which was pretty unrealistic but I was only interested in over-loading the single dual-core CPU in the PC. Soon after the test started, this was how the database instance’s home page looked.

CPU is at 100% and there are about 7 sessions active in the database on average, just under 2 of which are running on the CPU, while the rest are waiting. That looks about right. Now, if we look at the performance page at the end of the test, the results are quite different.

It looks like there were only ever about 2 sessions active on average! However, there are a couple of hints as to what’s going on. If we look at the graph above as well as the Average Active Sessions graph, we can see that the Host’s run queue is far in excess of the two available CPU cores.

Things become even more interesting if we look at the Top Activity screen for the same period.

In that view, the sessions waiting in the run queue are shown, so there’s an average of about 7 active sessions during the test. Take a good look at the difference in the key text for the two graphs as well. The Average Active Sessions graph on the Performance Home page shows ‘CPU Used’, whereas the Top Activity page shows ‘CPU’. It’s a small but significant and appropriate difference. The first graph shows you the average number of sessions that are actually using CPU and is related to DB CPU in the Time Model Stats, but the Top Activity screen is really showing ‘CPU + Wait for CPU’, which is the event you’ll see throughout an ASH report covering the same period. e.g.
Top Sessions
- ‘# Samples Active’ shows the number of ASH samples in which the session
was found waiting for that particular event. The percentage shown
in this column is calculated with respect to wall clock time
and not total database activity.
- ‘XIDs’ shows the number of distinct transaction IDs sampled in ASH
when the session was waiting for that particular event
- For sessions running Parallel Queries, this section will NOT aggregate
the PQ slave activity into the session issuing the PQ. Refer to
the ‘Top Sessions running PQs’ section for such statistics.
| Sid, Serial# |
% Activity |
Event |
% Event |
User |
Program |
# Samples Active |
XIDs |
| 136, 2716 |
12.75 |
CPU + Wait for CPU |
12.16 |
SH |
java.exe |
227/300 [ 76%] |
0 |
| 137, 120 |
12.64 |
CPU + Wait for CPU |
12.32 |
SH |
java.exe |
230/300 [ 77%] |
0 |
| 140, 164 |
12.59 |
CPU + Wait for CPU |
12.00 |
SH |
java.exe |
224/300 [ 75%] |
0 |
| 144, 240 |
12.53 |
CPU + Wait for CPU |
11.94 |
SH |
java.exe |
223/300 [ 74%] |
0 |
| 133, 29 |
12.37 |
CPU + Wait for CPU |
11.62 |
SH |
java.exe |
217/300 [ 72%] |
0 |
The more I think about this, the more I think that there’s probably a Rule Of Thumb here (I’m joking, really, but maybe it’s something I’ll look for during further tests). Maybe when DB Time is much bigger than the some of the wait times on various timed events plus CPU time, it’s indicative of CPU overload? Then again, maybe it’s just easier to check the process states and run queue? 
Finally, I couldn’t resist checking what ADDM makes of all this, specially for Graham Wood (that’s another PDF link). Well, in this case, ADDM is smart enough to work out what’s going on.
DETAILED ADDM REPORT FOR TASK 'TASK_46' WITH ID 46 --------------------------------------------------
Analysis Period: 28-AUG-2008 from 23:22:54 to 23:27:56 Database ID/Instance: 1533705496/1 Database/Instance Names: TEST1020/TEST1020 Host Name: DESKTOP Database Version: 10.2.0.4.0 Snapshot Range: from 26 to 27 Database Time: 2006 seconds Average Database Load: 6.6 active sessions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
FINDING 1: 100% impact (2006 seconds)-------------------------------------Host CPU was a bottleneck and the instance was consuming 98% of the host CPU. All wait times will be inflated by wait for CPU.
RECOMMENDATION 1: Host Configuration, 100% benefit (2006 seconds) ACTION: Consider adding more CPUs to the host or adding instances serving the database on other hosts. ACTION: Also consider using Oracle Database Resource Manager to prioritize the workload from various consumer groups.
ADDITIONAL INFORMATION: Host CPU consumption was 99%. CPU runqueue statistics are not available from the host's OS. This disables ADDM's ability to estimate the impact of this finding. The instance spent significant time on CPU. However, there were no predominant SQL statements responsible for the CPU load.