Wednesday, August 19, 2009

Finished OCM 11g

Got the result of OCM 11g exam (OCM 11g upgrade exam) after few weeks waiting. By clearing this one more OCM is added to my list of OCMs OCM(9i,10g and 11g) :)
It was pretty similar to 10g OCM exam in terms of effort and preparation required, only difference it is just one day as it is an upgrade exam for 10g OCMs.
Was great to know that was the first run of this exam, which is not available in production.
Topics for OCM 11g upgrade exam is already available in Oracle website. As per website,Exam also expected to available soon for public.

Wednesday, July 1, 2009

Data Dictionary corruption on noarchive log mode database - Is restore the only solution?

For any corruption case the first solution given( by even support) is to do a restore and recover or do an Rman Block media recovery(BMR). BMR is a feature I like very much !. In fact I have co-authored a note on metalink (Note 342972.1How to Do block media recovery when backups are not taken by rman) It was an out come of my experiments with RMAN and Backup and recovery in initial days in Oracle support- Co-authored with Ramkumar).

Ok all these things are good, if we are lucky to have backup and archive logs. What if we hit a data dictionary corruption when we are not that lucky to have them? Is it end of world? Definitely not. Here is a case I encountered last day.
Got a call from a friend and their 400GB database (noarchive log mode) got a block corruption. Issue was clarified as below after few iterations.
1) A query on one table fails with ORA-01578. The block which is being reports as corrupted is part of dependency$.
2) The ORA-1578 happens only when they use a synonym to access table.
3)Directly querying the table works well.
4)Attempt to drop the synonym also fails.
Ok so that make it clear that the corrupted block of dependency$ holds information for the synonym. Application is a packaged one and it is not possible to change the synonym it uses. (Otherwise simply we could have created another synonym and used that).

So what is the next solution? They are OK for any workaround. The DB will be refreshed from a production copy weeks later. So as long as the current issue is solved any further risk is acceptable. Good, then we can take any risk :)
Where is the name of synonym stored? Can we change it just there so that we would be able to create a new synonym with same original name?
Yes , possible. This is what we did finally..
Updated table obj$ to change name of synonym to originalname_old.
(Its a data dictionary update, which need to be followed by a commit and shutdown Abort)
That did the trick. They were able to create synonym with original name and application started functioning smoothly!!.

Tuesday, May 26, 2009

DBA Round Table.

Today delivered a small case study/presentation in a DBA Round Table event organised by Oracle singapore.
Curx of the Session was
1)How sessions will show idle wait event even when users are waiting for response from the same session. Talks about problems with shared server configuration.
2) Bind Peeking happens even without histogram. Demonstrated case showed one example- Dynamic Sampling.

Tuesday, April 21, 2009

DB Performance is bad due to slow IO. Still Sysadmin need proof!!

Performance problems:
Got into another performance issue trouble shooting. Started with just like any other perf issue. Application team reported that processing is slow!. A quick look on AWR reports revealed that the average time for following timed events has increased from few ms(5-6) to few thousand ms (7000-8000). A usuall clear case of IO subsystem issue.

db file sequential read 121,032 3,441 28 60.8 User I/O
db file parallel write 123 2,928 23,808 51.8 System I/O---->avg 23808!!!
db file scattered read 32,800 866 26 15.3 User I/O
free buffer waits 38,301 381 10 6.7 Configuration
local write wait 368 359 974 6.3 User I/O

(Incidently, few words about "local write wait event".- local write wait event can occur during concurrent truncate operations.Basically what happens here is a session is waiting for its own dirtly buffers to be written to disk. Remember during truncate all dirty buffers of the table that is being truncated must be written before truncate completes. However the presence of the local write wait event in this scenario should be directly related to overall IO slowness.
There is a metalink note that gives details of local write wait that occur during multiple concurrent truncate operations. Note:334822.1)

Syadmins came into picture... After doing a "complete" check he declared there is nothing wrong at system side. The responsibility of proving it is an IO issue again back to DBAs head!!.
Took SQLStats for one of the SQL degraded. It clearly shows there is no plan change, there is no change in the number of blocks query touches etc... Only increase in the component "User IO".
However sysadmin is still sure there is no issue at System level. Again since DBA is claiming it is IO issue , and sysadmin is denying any chance of IO problem management want DBA to "prove" it is an IO issue. Suprisingly nobody is ready to belive Oracle AWR and other reports.

Ok. no choice , but to do troubleshooting at IO level (which sysadmin should do).
The direct way was to measure the performance using "dd" command.
Just did a read and write test using dd.

Write test:
# time dd if=/dev/zero of=sample.data bs=8192 count=131072
real 0m1.352s
user 0m0.002s
sys 0m1.168s
It took less than 2 sec on another server.
Now ran the same test on database server.

It did not came out even in next 15 mins !!!!.
sighhhhh.. finally proved.!!

Hmm sysadmin still came up with next arguement!! bs(blocksize) matters a lot when you use dd. Need to check what is the block size of file system.!!
God please forgive the ignorance !!!
He could not survive long with this arguement. Soon management suggested them to failover the database to another machine in the cluster. Application started the job agian once db was started on new server.. According to them job was "flying"!!!!
I did n't see the sysadmin roaming there after this!!!... Later I was told that there were some issue in IO subsystem, which was identified by some vendor !!

Crux: DBA need to be equiped with OS tools that can prove to sysadmin in a language he understands(I dont know if some of them understand anything!!)
References:
db file sequential read
'A sequential read is a single-block read'.Single block I/Os are usually the result of using indexes. Rarely, full table scan calls could get truncated to a single block call due to extent boundaries, or buffers already present in the buffer cache. These waits would also show up as 'db file sequential read'.

Tuesday, February 3, 2009

Reproduce error, the best way to learn !!

There are serveral Oracle errors which are intermittent. Learning to reproduce this error teaches hows to prevent them too..

Recently I ecountered the following error in a critical production system.

ORA-28547: connection to server failed, probable Oracle Net admin error

The error was very intermittent and disappeared by its own after an hour. Till that time all connection attempts waited for a sometime and gave the above error.



The oerr description of error was not helpful as it talked about persistent issues.


Following observations were made.
The database in shared server mode.
All shared servers were busy in executing SQLs when the error happened.

I wanted to reproduce the problem to make sure that it has something to do with Shared server.

Did the following.
Set my test database to shared server mode.

Set max_shared_servers to 1

made a connection to the database via listener.
Verified the connection is made via listener.

SQL> select server from v$session where userNAME='SYSTEM' 2 ;
SERVER
---------
NONE

(inactive shared server connections shows "none" for v$session.server).


Next plan was to make the available shared server busy.

Just ran a small PL/SQL block like the following.

SQL> begin 2 for i in 1..1000000 loop 3 insert into t values(i); 4 commit; 5 end loop; 6 end; 7 /

Verified the session is actually taking the help of shared server.

SQL> select server from v$session where userNAME='SYSTEM' 2 ;SERVER
---------
SHARED

Ok, in my system the only shared server I have is busy in serving above command.

Attempted another connection via listener..

Wow !! session waited for 60 sec. and then gave
ORA-28547: connection to server failed, probable Oracle Net admin error.


A possible explanation:
Once the connection request reaches listener , it will be handed off. For a shared server connection , shared server must be available to do the authentication. If , all available shared servers are busy, then the connection request can time out with this error.


ORA-28547: connection to server failed, probable Oracle Net admin error.

Once the shared servers are free , the situation will be over..
A quick test to make sure that we are hitting on shared server issue, can attempt a connection request with (SERVER=DEDICATED) or attempt a local connection. This should work even when the shared server connections are failing.