I’ve been working with Audit Vault 10.2.3 on AIX recently and ran into a problem that someone else might one day. I decided to capture the output as best I can and blog about it later. (As I don’t have the same system available any more, the formatting might leave a bit to be desired and there are one or two little steps missing that I would liked to have shown but I don’t think they’re crucial.
It was just a Pre-Production system that we could play around with but I noticed that there were no recent audit trail entries when I tried some tests. On closer inspection, the most recent audit record was a couple of days old, so it looked like the job which inserts new audit records into the Audit Vault Data Warehouse wasn’t running properly. The first step was to connect to that database instance and check the status of the job.
SQL> select owner, job_name, elapsed_time from all_scheduler_running_jobs;
OWNER JOB_NAME ------------------------------ ------------------------------ ELAPSED_TIME --------------------------------------------------------------------------- AVSYS REFRESH_WAREHOUSE_DATA +002 21:06:37.19
OK, so the Warehouse Refresh job has been running for a few days (2 days, 21 hours and 6 minutes). When I looked at the session for that job, it was waiting on "enq: PS - contention" which is (trumpet salute) a parallel execution wait event.
1 select sid, username, last_call_et, event, wait_time, seconds_in_wait, state, blocking_session, blocking_session_status 2 from v$session 3* where sid=201 SQL> /
SID USERNAME LAST_CALL_ET ---------- ------------------------------ ------------ EVENT WAIT_TIME ---------------------------------------------------------------- ---------- SECONDS_IN_WAIT STATE BLOCKING_SESSION BLOCKING_SE --------------- ------------------- ---------------- ----------- 201 AVSYS 249575 enq: PS - contention 0 48361 WAITING 169 VALID
It’s a bit of a red herring because it’s really waiting on the PX slave, so the real event should be in that session 169.
SQL> c/201/169 3* where sid=169 SQL> /
SID USERNAME LAST_CALL_ET ---------- ------------------------------ ------------ EVENT WAIT_TIME ---------------------------------------------------------------- ---------- SECONDS_IN_WAIT STATE BLOCKING_SESSION BLOCKING_SE --------------- ------------------- ---------------- ----------- 169 AVSYS 249250 cursor: pin X 0 0 WAITING UNKNOWN
So, looking at that made me Google around and come up with this. Which implies that I should look at bug number 5908030 which is stated as a duplicate of 5476091. Both RDBMS bugs and both will be fixed in (gulp) 11.2.
The next job was to kill the currently running refresh job and restart it, particularly as this was just Pre-Prod and the lack of audit records in the Warehouse was holding up testing. I tried various approaches to stopping the job, running from the highest level more elegant approaches, down to brute force.
SQL> exec dbms_scheduler.stop_job('AVSYS.REFRESH_WAREHOUSE_DATA');
BEGIN dbms_scheduler.stop_job('AVSYS.REFRESH_WAREHOUSE_DATA'); END;
*
ERROR at line 1:
ORA-27365: job has been notified to stop, but failed to do so immediately
ORA-06512: at "SYS.DBMS_ISCHED", line 164
ORA-06512: at "SYS.DBMS_SCHEDULER", line 483
ORA-06512: at line 1
Tried that a few times and decided it was probably well and truly stuck, particularly when the bugs were referring to systemstate dumps, so I tried killing the stuck slave, first
1 select sid, username, last_call_et, event, wait_time, seconds_in_wait, state, blocking_session, blocking_session_status 2 from v$session 3* where sid=201
SQL> /
SID USERNAME LAST_CALL_ET ---------- ------------------------------ ------------ EVENT WAIT_TIME ---------------------------------------------------------------- ---------- SECONDS_IN_WAIT STATE BLOCKING_SESSION BLOCKING_SE --------------- ------------------- ---------------- ----------- 201 AVSYS 251777 enq: PS - contention 0 50563 WAITING 169 VALID SQL> 3 3* where sid=201 SQL> c/201/169 3* where sid=169 SQL> /
SID USERNAME LAST_CALL_ET ---------- ------------------------------ ------------ EVENT WAIT_TIME ---------------------------------------------------------------- ---------- SECONDS_IN_WAIT STATE BLOCKING_SESSION BLOCKING_SE --------------- ------------------- ---------------- ----------- 169 AVSYS 251446 cursor: pin X 0 0 WAITING UNKNOWN
SQL> select sid, serial# from v$session where sid=169; SID SERIAL# ---------- ---------- 169 1585
SQL> alter system kill session '169,1585'; alter system kill session '169,1585' * ERROR at line 1: ORA-00031: session marked for kill
I waited and tried a few more times, but that slave really didn’t want to die …
SQL> l 1 select sid, username, last_call_et, event, wait_time, seconds_in_wait, state, blocking_session, blocking_session_status 2 from v$session 3* where sid=169 SQL> /
SID USERNAME LAST_CALL_ET ---------- ------------------------------ ------------ EVENT WAIT_TIME ---------------------------------------------------------------- ---------- SECONDS_IN_WAIT STATE BLOCKING_SESSION BLOCKING_SE --------------- ------------------- ---------------- ----------- 169 AVSYS 251725 cursor: pin X 0 0 WAITING UNKNOWN
So I tried try killing the process
SQL> select sid, serial#, paddr, osuser, process, status 2 from v$session where sid=169;
SID SERIAL# PADDR OSUSER ---------- ---------- ---------------- ------------------------------ PROCESS STATUS ------------ -------- 169 1585 070000005A383348 oracle 962750 KILLED
SQL> !ps -ef|grep 962750 oracle 962750 1 1 Jul 13 - 11:49 ora_p006_AV01TEST oracle 1519634 1757348 2 12:11:27 pts/1 0:00 grep 962750 SQL> !kill -9 962750 SQL> !ps -ef|grep 962750 oracle 1519638 1757348 1 12:11:38 pts/1 0:00 grep 962750
Now it’s gone, see if parent session for refresh job is still there
1 select sid, username, last_call_et, event, wait_time, seconds_in_wait, state, blocking_session, blocking_session_status 2 from v$session 3* where sid=201 SQL> /
no rows selected
No, so the job should be gone too
SQL> select * from all_scheduler_running_jobs;
OWNER JOB_NAME ------------------------------ ------------------------------ JOB_SUBNAME SESSION_ID SLAVE_PROCESS_ID SLAVE_OS_PRO ------------------------------ ---------- ---------------- ------------ RUNNING_INSTANCE RESOURCE_CONSUMER_GROUP ---------------- -------------------------------- ELAPSED_TIME --------------------------------------------------------------------------- CPU_USED --------------------------------------------------------------------------- AVSYS REFRESH_WAREHOUSE_DATA 122 19 692392 1 +000 00:01:25.48 +000 00:00:18.38
Well, it had restarted and been running for a minute or so. The audit trail entries started to appear in the warehouse, anyway. As I won’t be working with this again for a while, I’m counting on colleagues to keep me posted about whether this keeps happening!
Continue reading “Hanging Audit Vault Warehouse Refresh Job”