HPUNIX调整文件打开数.docx

资源描述

HPUNIX调整文件打开数.docx

《HPUNIX调整文件打开数.docx》由会员分享，可在线阅读，更多相关《HPUNIX调整文件打开数.docx（9页珍藏版）》请在冰点文库上搜索。

HPUNIX调整文件打开数.docx

HPUNIX调整文件打开数

一天，在一个业务高峰期，某省的一个数据库突然宕机了，根据alertlog中发现的报错：

MonJan1209:

49:

422009

ARC0:

Evaluatingarchive log5thread1sequence142032

ARC0:

Beginningtoarchivelog5thread1sequence142032

CreatingarchivedestinationLOG_ARCHIVE_DEST_1:

'/oraarchlog/arch_1_142032.arc'

MonJan1209:

49:

562009

ARC0:

Completedarchiving log5thread1sequence142032

MonJan1209:

57:

572009

Errorsinfile/oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:

ORA-00313:

openfailedformembersofloggroup1ofthread1

ORA-00312:

onlinelog1thread1:

'/dev/vg_ora3/rredo_256m_21'

ORA-27041:

unabletoopenfile

HP-UXError:

23:

Filetableoverflow

Additionalinformation:

ORA-00312:

onlinelog1thread1:

'/dev/vg_ora2/rredo_256m_11'

ORA-27041:

unabletoopenfile

HP-UXError:

23:

Filetableoverflow

Additionalinformation:

ORA-00312:

onlinelog1thread1:

'/dev/vg_ora1/rredo_256m_01'

ORA-27041:

unabletoopenfile

HP-UXError:

23:

Filetableoverflow

Additionalinformation:

MonJan1209:

57:

572009

Errorsinfile/oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:

ORA-00313:

openfailedformembersofloggroup1ofthread1

ORA-00312:

onlinelog1thread1:

'/dev/vg_ora3/rredo_256m_21'

ORA-27041:

unabletoopenfile

HP-UXError:

23:

Filetableoverflow

Additionalinformation:

ORA-00312:

onlinelog1thread1:

'/dev/vg_ora2/rredo_256m_11'

ORA-27041:

unabletoopenfile

HP-UXError:

23:

Filetableoverflow

Additionalinformation:

看来是打开文件数太多，导致了此次数据库宕机。

通过监控打开文件数发现，其实在前一天的时候，打开文件数已经到98%，在上午8点多的时候，打开文件数已经到99%，但是一直都没有引起监控人员的注意，终于导致打开文件数被撑爆。

我们通过sar-v可以发现打开文件数的总数：

HP-UXmy_db01B.11.11U9000/800 01/17/09

01:

08:

52text-sz ov proc-sz ov inod-sz ov file-sz ov

01:

08:

54 N/A N/A1139/4096 0 2678/348160 113494/2000100

01:

08:

56 N/A N/A1139/4096 0 2678/348160 113494/2000100

01:

08:

58 N/A N/A1135/4096 0 2675/348160 113487/2000100

01:

09:

00 N/A N/A1135/4096 0 2675/348160 113487/2000100

01:

09:

02 N/A N/A1135/4096 0 2675/348160 113487/2000100

其中的file-sz就是指打开文件数。

其中20多万的那个值是打开文件数的阈值，前者是当前实际打开的文件数。

这阈值是有核心参数nfile决定的。

我们在设置nfile的时候，建议是（15*NPROC+2048）；提醒值是oracle的process数乘以datafile数再加2048；极限值，在存数据库环境应该是os的极限进程数乘以数据文件数，即每个进程都打开所有的数据文件。

Nfile

MaxNumberofsimultaneouslyOpen

filessystem-wideatanygiventime.Total

numberofslotsitthefiledescriptortable,

default=16*（nproc+16+maxusers）/10+32+2*（npty+nstrpty）

recommended:

tousedefault.

>=Oracle9i=（15*NPROC+2048）

ForOracleinstallationswithahighnumberofdatafilesthismightbenotenough,thanusethe

>=（（numberoforacleprocesses）*（numberofOracledatafiles）+2048）

Nproc

MaxNumberofProcessesthatcanexist

simultaneouslyinthesystem,

default=（20+8*MAXUSERS）,

influencesninode,nfile.

recommended:

>=Oracle9i=4096

根据计算，如果根据建议值，那么nfile将设置成63488，如果根提醒值，将有141万，如果根据极限值，将有386万：

SQL> !

oracle@my_db01:

/oracle>kmtune|grepnproc

nproc 4096 - 4096

oracle@my_db01:

/oracle>exit

SQL>select15*4096+2048fromdual;

15*4096+2048

------------

63488

SQL>showparameterprocess

NAME TYPE VALUE

-----------------------------------------------------------------------------

processes integer 1500

SQL>

SQL>selectcount（*）fromv$datafile;

COUNT（*）

----------

944

SQL>select1500*944+2048fromdual;

1500*944+2048

-------------

1418048

SQL>select4096*944fromdual;

4096*944

----------

3866624

SQL>

对于设置6万，肯定是完全不适合我们这个系统；如果设置141万或者设置386万，这似乎又太大了。

到底是什么进程需要打开那么多文件数？

在sar-v的命令中，我们只看到了一个总数，能具体的去看什么进程打开多少文件数吗？

在这里，HP提供了一个工具：

crashinfo。

可以查看各个进程的打开文件数：

crashinfo

用法是：

在root权限下：

my_db01#[/tmp]chmod+xcrashinfo

my_db01#[/tmp]./crashinfo-ofiles

此时能看到各个进程的打开文件数情况：

my_db01#[/tmp]./crashinfo-ofiles

crashinfo（4.74）

Note:

HPCONFIDENTIAL

libp4（9.305）:

Opening/stand/vmunix/dev/kmem

Loadingsymbolsfrom/stand/vmunix

KernelTEXTpagesnotrequestedincrashconf

Willuseanartificialmappingfroma.outTEXTpages

Commandline:

./crashinfo-ofiles

crashinfo（4.74）

pid p_maxofofilesp_comm

------------------------

1 4096 2 init

2011 4096 4 getty

94 4096 4 evmmirr

62 4096 4 emcpdaemon

63 4096 4 emcpProcd

64 4096 4 emcpd

65 4096 4 emcpdfd

66 4096 4 emcpwdd

67 4096 4 emcpstratd

68 4096 4 MpAsyncIoDaemo

69 4096 4 MpTestDaemon

70 4096 4 MpPirpNotDoneD

71 4096 4 MpPeriodicDaem

72 4096 4 MpDeferredDaem

73 4096 4 MpGeneralReque

74 4096 4 MpcAsyncIoDaem

75 4096 4 MpcResumeIoDae

76 4096 4 MpcPeriodicCal

129864096 671 oracle

130054096 671 oracle

132914096 10 oracle

129914096 671 oracle

……

224664096 51 oracle

123004096 847 oracle

128984096 9 oracle

123024096 979 oracle

130034096 671 oracle

130094096 670 oracle

130124096 459 oracle

130144096 49 oracle

Total=116775

my_db01#[/tmp]ps-ef|grep12298

oracle12298 1 0 Jan15 ?

31ora_dbw0_orasid

root1246911867 102:

06:

39pts/te 0:

00grep12298

my_db01#[/tmp]ps-ef|grep12300

oracle12300 1 0 Jan15 ?

32ora_dbw1_orasid

root1336311867 002:

16:

09pts/te 0:

00grep12300

my_db01#[/tmp]

通过ps-ef|grep打开文件数最多的几个进程号，发现打开文件数最多的进程是dbwr，smon，lgwr，chkpt。

每个进程基本上800～900个打开文件数，且dbwr进程有8个，打开文件数那就更加多了。

其他打开文件数较多的进程是LOCAL=NO的进程，通过观察spid关联sql_text，发现是对历史表的操作。

历史表是大表，表空间有300多个数据文件，因此，这些进程的打开文件数，也在每个600左右。

此时忽然记起，前段时间该省做优化时，当时DBA将dbwr进程数从2调整到了8，dbwr数增加了4倍，导致打开文件数也相应的增加，这应该就是造成nfile在近期慢慢增长到撑爆为止的原因了吧。

找到了原因，问题就比较好解决了。

找了一个停机的时间，把db_writer_processes调整回2，且为了以防万一，再将nfile值改为40万。

附：

crashinfo命令的其他用法：

Usage:

crashinfo[options...][coredir|kernelcore]

Default:

coredir="."if"INDEX"filepresentelse

kernel="/stand/vmunix"core="/dev/kmem"

Options:

-h|-help[flag,flag...]

flags:

detail

-u|-update

-v|-verbose

-c|-continue

-H|-Html

-e|-email[,flag,flag...]

flags:

file=

from=

callid=

-t|-trace[flag,flag...]

flags:

args

regs （PAOnly）

Rregs （PAOnly）

locals （IA64Only）

frame （IA64Only）

mems （IA64Only）

bsp （IA64Only）

ss （IA64Only）

-s|-syscall

-f|-full_comm

-S|-Sleep

-l|-listonly[flag,flag...]

flags:

pid=

-top[count]

-p|-proctree[flag,flag...]

flags:

tids

-i|-ioscan

-ofiles[pid][,flag,flag...]

flags:

tids

-signals[pid]

-seminfo

-unlinked[all]

-m|-msginfo[flag,flag...]

flags:

skip

id=

msg=

type=

-flocks[flag,flag...]

flags:

sleep

vnode=

inode=

-sgfr[version]

-fwlog[cpu]

-vmtrace[flag,flag...]

flags:

bucket=

arena=

count=

leak

cor

log

parse

如：

my_db01#[/tmp]ps-ef|grepckpt

oracle12304 1 0 Jan15 ?

29ora_ckpt_orasid

root1350911867 002:

18:

35pts/te 0:

00grepckpt

my_db01#[/tmp]

my_db01#[/tmp]./crashinfo-ofiles12304

crashinfo（4.74）

Note:

HPCONFIDENTIAL

libp4（9.305）:

Opening/stand/vmunix/dev/kmem

Loadingsymbolsfrom/stand/vmunix

KernelTEXTpagesnotrequestedincrashconf

Willuseanartificialmappingfroma.outTEXTpages

Commandline:

./crashinfo-ofiles12304

crashinfo（4.74）

pid p_maxofofilesp_comm

------------------------

123044096 961 oracle

my_db01#[/tmp]

·【文章发布信息】发表于：

2009-01-17@02:

22:

05·||分类：

..experience,Workingcase

·【分享或收藏】：

5条评论»

1.David.Guo于2009-01-17@19:

15:

27留言：

nfile的使用要监控起来，我们这边一般到85就开始要走应急流程处理了。

85，是生命线呀

2.小荷于2009-01-17@23:

40:

53留言：

reDavid.Guo：

我们也是有监控的，不过没引起监控人员的注意。

唉，失败呀～～～

不过更失败的是SA的serviceguard没配好，没monitor到oracle进程，数据库宕机后竟然没自动切换到B机。

3.dhhb于2009-01-19@12:

50:

56留言：

我们这里nfile是100W.呵呵

我记得lsof可以看打开文件数的。

4.oracle.dba于2009-02-25@21:

51:

17留言：

最近也遇到这个问题了，但是没有工具进行分析查看。

能否把这个crashinfo的工具共享下，网站上下载不下来，

发我邮箱也可以

ora315@

谢谢！

5.make于2009-05-14@17:

06:

27留言：

我也遇到这个问题，按照metalink的提示增加了

nproc=MaxNumberofProcesses

nfile=MaxNumberofOpenFiles

nflocks=MaxNumberofFileLocks

的值，但是file-sz的值一直在涨（要是股票就好了），最后又down了，再增大，再dwon，我晕了。

请何大师指点

展开阅读全文