Tuesday, April 21, 2009

DB Performance is bad due to slow IO. Still Sysadmin need proof!!

Performance problems:
Got into another performance issue trouble shooting. Started with just like any other perf issue. Application team reported that processing is slow!. A quick look on AWR reports revealed that the average time for following timed events has increased from few ms(5-6) to few thousand ms (7000-8000). A usuall clear case of IO subsystem issue.

db file sequential read 121,032 3,441 28 60.8 User I/O
db file parallel write 123 2,928 23,808 51.8 System I/O---->avg 23808!!!
db file scattered read 32,800 866 26 15.3 User I/O
free buffer waits 38,301 381 10 6.7 Configuration
local write wait 368 359 974 6.3 User I/O

(Incidently, few words about "local write wait event".- local write wait event can occur during concurrent truncate operations.Basically what happens here is a session is waiting for its own dirtly buffers to be written to disk. Remember during truncate all dirty buffers of the table that is being truncated must be written before truncate completes. However the presence of the local write wait event in this scenario should be directly related to overall IO slowness.
There is a metalink note that gives details of local write wait that occur during multiple concurrent truncate operations. Note:334822.1)

Syadmins came into picture... After doing a "complete" check he declared there is nothing wrong at system side. The responsibility of proving it is an IO issue again back to DBAs head!!.
Took SQLStats for one of the SQL degraded. It clearly shows there is no plan change, there is no change in the number of blocks query touches etc... Only increase in the component "User IO".
However sysadmin is still sure there is no issue at System level. Again since DBA is claiming it is IO issue , and sysadmin is denying any chance of IO problem management want DBA to "prove" it is an IO issue. Suprisingly nobody is ready to belive Oracle AWR and other reports.

Ok. no choice , but to do troubleshooting at IO level (which sysadmin should do).
The direct way was to measure the performance using "dd" command.
Just did a read and write test using dd.

Write test:
# time dd if=/dev/zero of=sample.data bs=8192 count=131072
real 0m1.352s
user 0m0.002s
sys 0m1.168s
It took less than 2 sec on another server.
Now ran the same test on database server.

It did not came out even in next 15 mins !!!!.
sighhhhh.. finally proved.!!

Hmm sysadmin still came up with next arguement!! bs(blocksize) matters a lot when you use dd. Need to check what is the block size of file system.!!
God please forgive the ignorance !!!
He could not survive long with this arguement. Soon management suggested them to failover the database to another machine in the cluster. Application started the job agian once db was started on new server.. According to them job was "flying"!!!!
I did n't see the sysadmin roaming there after this!!!... Later I was told that there were some issue in IO subsystem, which was identified by some vendor !!

Crux: DBA need to be equiped with OS tools that can prove to sysadmin in a language he understands(I dont know if some of them understand anything!!)
References:
db file sequential read
'A sequential read is a single-block read'.Single block I/Os are usually the result of using indexes. Rarely, full table scan calls could get truncated to a single block call due to extent boundaries, or buffers already present in the buffer cache. These waits would also show up as 'db file sequential read'.