问题描述:
10月25日上午滨州网通的工程师报告oss应用系统运行缓慢,具体操作是通过oss系统查询话单,很长时间才能返回结果,严重影响了客户的正常使用。
问题处理:
1.登陆数据库主机,用sar命令看到idle的值持续为0,cpu的资源已经耗尽:
bz_db1# sar 2 4
sunos kest 5.8 generic_108528-19 sun4u 10/26/04
10:56:46 %usr %sys %wio %idle
10:56:48 1 4 95 0
10:56:50 1 5 94 0
10:56:52 0 6 93 0
10:56:54 1 6 93 0
average 1 5 94 0
2.使用top命令看到有两个明显占用cpu利用率过高的进程,以下是top命令的结果:
bz_db1# top
last pid: 1664;load averages: 3.26, 3.24, 3.69
159 processes: 152 sleeping, 2 running, 2 zombie, 1 stopped, 2 on cpu
cpu states: 1.5% idle, 72.5% user, 17.9% kernel, 8.0% iowait, 0.0% swap
memory: 2.0g real, 233m free, 2.0g swap in use, 3.4g swap free
pid username thr pr nce size res state time flts cpu command
27420 oracle 1 10 0 1.3g 1.2g cpu01 22.9h 2 31.94% oracle
27418 oracle 1 10 0 1.3g 1.2g run 23.0h 6 26.86% oracle
5943 oracle 1 59 0 1.3g 1.2g sleep 25:26 37 4.92% oracle
6295 oracle 1 55 0 1.3g 1.2g run 25:14 74 4.90% oracle
7778 oracle 1 43 0 1.3g 1.2g sleep 11:43 110 4.86% oracle
13270 oracle 1 59 0 1.3g 1.2g sleep 210.6h 0 0.96% oracle
13056 oracle 1 48 0 1.3g 1.2g sleep 303:30 0 0.37% oracle
10653 root 1 58 0 2560k 1624k cpu00 0:00 0 0.32% top
18827 oracle 1 58 0 1.3g 1.2g sleep 18.4h 0 0.31% oracle
12748 oracle 258 58 0 1.3g 1.2g sleep 555:14 0 0.21% oracle
10634 oracle 1 59 0 1.3g 1.2g sleep 0:01 0 0.21% oracle
28458 oracle 1 58 0 1.3g 1.2g sleep 535:02 0 0.18% oracle
13075 oracle 1 59 0 1.3g 1.2g sleep 326:33 0 0.15% oracle
13173 oracle 1 58 0 1.3g 1.2g sleep 593:07 0 0.13% oracle
4927 oracle 1 59 0 1.3g 1.2g sleep 33.4h 0 0.11% oracle
可以看到这两个进程号分别是27420和27418.
3.捕获占用cpu利用率过高的sql语句:
以下用到了我总结的sql语句:
sql>set line 240
sql>set verify off
sql>column sid format 999
sql>column pid format 999
sql>column s_# format 999
sql>column username format a9 heading “ora user”
sql>column program format a29
sql>column sql format a60
sql>column osname format a9 heading “os user”
sql>select p.pid pid,s.sid sid,p.spid spid,s.username username,
s.osuser osname,p.serial# s_#,p.terminal,p.program program,
p.background,s.status,rtrim(substr(a.sql_text, 1, 80)) sql
from v$process p, v$session s,v$sqlarea a where p.addr = s.paddr
and s.sql_address = a.address (+) and p.spid like %&1%;
enter value for 1: 27420(注意这里应输入占用cpu最高的进程对应的pid)
得到以下sql语句:
select nvl(sum(localcharge),0),nvl(sum(usage),0) from localusage where to_char(endtime,yyyymmdd)=20041016
and localcharge>0 and caller like 0543886%;
27418进程对应的sql语句如下:
select nvl(sum(localcharge),0) from localusage where to_char(endtime,yyyymmdd)=20041016 and caller like 0543888%;
4.使用相关用户连接到数据库,检查其执行计划:
sql>connect wacos/oss
connected.
sql>@?/rdbms/admin/utlxplan.sql
table created.
sql>set autotrace on
sql>set timing on
sql>select nvl(sum(localcharge),0),nvl(sum(usage),0) from localusage where to_char(endtime,yyyymmdd)=20041016
and localcharge>0 and caller like 0543886%;
nvl(sum(localcharge),0) nvl(sum(usage),0)
———————– —————–
0 0
elapsed: 00:02:56.37
execution plan
———————————————————-
0 select statement optimizer=choose (cost=13435 card=1 bytes=5
3)
1 0 sort (aggregate)
2 1 partition range (all)
3 2 table access (full) of localusage (cost=13435 card=1
81 bytes=9593)
statistics
———————————————————-
258 recursive calls
0 db block gets
88739 consistent gets
15705 physical reads
0 redo size
580 bytes sent via sql*net to client
651 bytes received via sql*net from client
2 sql*net roundtrips to/from client
8 sorts (memory)
0 sorts (disk)
1 rows processed
发现对localusage表做了全表扫描,什么记录也没有返回居然用了2分多钟。
sql> select nvl(sum(localcharge),0) from localusage where to_char(endtime,yyyymmdd)=20040816 and caller like 0543888%;
nvl(sum(localcharge),0)
———————–
27.6
elapsed: 00:03:56.46
execution plan
———————————————————-
0 select statement optimizer=choose (cost=13435 card=1 bytes=4
0)
1 0 sort (aggregate)
2 1 partition range (all)
3 2 table access (full) of localusage (cost=13435 card=3
615 bytes=144600)
statistics
———————————————————-
0 recursive calls
0 db block gets
88588 consistent gets
15615 physical reads
0 redo size
507 bytes sent via sql*net to client
651 bytes received via sql*net from client
2 sql*net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
这个sql语句有结果返回,发现也是对localusage表做了全表扫描,但速度也很慢,用了3分多钟。
sql> select count(*) from localusage;
count(*)
———-
5793776
该表有579万多条记录,数据量很大,全表扫描已经不再适合。
5.检查该表的类型:
sql> select index_name, table_name, status, partitioned from user_indexes where table_name=localusage;
index_name table_name status par
—————————— —————————— ——– —
i_localusage_sid localusage n/a yes
ui_localusage_st_seq localusage n/a yes
sql> select index_name,table_name,locality from user_part_indexes where table_name=localusage;
index_name table_name locali
—————————— —————————— ——
i_localusage_sid localusage local
ui_localusage_st_seq localusage local
发现该表是分区表,并在serviceid,startime和cdrsequence列上建立了分区索引,索引类型是local索引。
6.查看分区索引的索引键值:
sql> select index_name,column_name,index_owner from dba_ind_columns where table_name=localusage;
index_name column_name index_owner
——————– ——————– ——————————
i_localusage_sid serviceid wacos
ui_localusage_st_seq starttime wacos
ui_localusage_st_seq cdrsequence wacos
发现在endtime和caller列上都没有建立索引,这也是导致sql语句做全表扫描的最终原因。
7.决定创建新的分区索引以消除全表扫描:
(1).首先查看localusage表分区情况:
sql> select partition_name,tablespace_name from user_tab_partitions where table_name=localusage;
partition_name tablespace_name
—————————— ——————————
localusage_200312 wacos
localusage_200401 wacos
localusage_200402 wacos
localusage_200404 wacos
localusage_200405 wacos
localusage_200406 wacos
localusage_200407 wacos
localusage_200409 wacos
localusage_200410 wacos
localusage_200411 wacos
localusage_200403 wacos
localusage_200408 wacos
localusage_200412 wacos
13 rows selected.
(2).在caller列上创建local分区索引:
sql>set timing on
sql>create index i_localusage_caller on localusage(caller)
local
(
partition localusage_200312,
partition localusage_200401,
partition localusage_200402,
partition localusage_200404,
partition localusage_200405,
partition localusage_200406,
partition localusage_200407,
partition localusage_200409,
partition localusage_200410,
partition localusage_200411,
partition localusage_200403,
partition localusage_200408,
partition localusage_200412
)
tablespace wacos
storage(
initial 6553600
next 6553600
maxextents unlimited
pctincrease 0)
pctfree 5
nologging;
index created.
elapsed: 00:06:27.90 (由于数据量比较大,耗时6分钟)
8.再次查看执行计划:
sql>select nvl(sum(localcharge),0),nvl(sum(usage),0) from localusage where to_char(endtime,yyyymmdd)=20041016
and localcharge>0 and caller like 0543886%;
nvl(sum(localcharge),0) nvl(sum(usage),0)
———————– —————–
0 0
elapsed: 00:00:03.00
execution plan
———————————————————-
0 select statement optimizer=choose (cost=22 card=1 bytes=53)
1 0 sort (aggregate)
2 1 partition range (all)
3 2 table access (by local index rowid) of localusage (cost=22 card=181 bytes=9593)
4 3 index (range scan) of i_localusage_caller (non-unique) (cost=14 card=65063)
statistics
———————————————————-
0 recursive calls
0 db block gets
16813 consistent gets
569 physical reads
0 redo size
580 bytes sent via sql*net to client
651 bytes received via sql*net from client
2 sql*net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
这次走了索引后速度明显快多了,用了3秒钟就返回了结果。
sql>select nvl(sum(localcharge),0) from localusage where to_char(endtime,yyyymmdd)=20040816 and caller like 0543888%;
nvl(sum(localcharge),0)
———————–
27.6
elapsed: 00:00:24.73
execution plan
———————————————————-
0 select statement optimizer=choose (cost=22 card=1 bytes=40)
1 0 sort (aggregate)
2 1 partition range (all)
3 2 table access (by local index rowid) of localusage (cost=22 card=3615 bytes=144600)
4 3 index (range scan) of i_localusage_caller (non-unique) (cost=14 card=65063)
statistics
———————————————————-
0 recursive calls
0 db block gets
129336 consistent gets
7241 physical reads
0 redo size
507 bytes sent via sql*net to client
651 bytes received via sql*net from client
2 sql*net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
这个sql语句走了索引,用了24秒钟返回结果,性能明显提高了很多。