今日有套aix 10G RAC数据库节点1alert日志报LMS 0: 8069 GCS shadows traversed, 4001 replayed如下错误,因节点2重启导致。

 

后上网查看了些资料,如果修改系统时间也会报如上错误并导致机器重启。

 

转载下itpub上kamus的一篇文章:

 

 

除了Windows和Linux,10.2.0.2以后的RAC是不是修改操作系统时间都会导致操作系统重启

在Oracle10.2.0.3 RAC的测试中,发现如果修改某个节点的系统时间超过1.5秒,那么这个节点会被自动重新启动。

好狠的处理方式 ......

详细机制参见Internal Only的Metalink Note 308051.1。

The OPROCD executable sets a signal handler for the SIGALRM handler and sets the interval timer based on the to-millisec parameter provided.  The alarm handler gets the current time and checks it against the time that the alarm handler was last entered.  If the difference exceeds (to-millisec + margin-millisec), it will fail; the production version will cause a node reboot. 

尝试修改/etc/init.cssd中关于OPROCD的配置,将DISABLE_OPROCD设置为TRUE,然后重新启动系统,在系统进程中已经不存在oprocd进程,但是居然修改完系统时间以后,机器仍然被重新启动了。

文档中另外的描述提到,如果OPROCD是在non fatal mode状态下启动的,那么将只会写一段log而不去重新启动机器,并且在Note:265769.1中也描述了如何修改为non fatal mode,但是我没有去尝试。

In fatal mode, OPROCD will reboot the node if it detects excessive wait. In Non Fatal mode, it will write an error message out to the file <hostname>.oprocd.log in one of the following directories.

最后尝试的结果是将整个cssd进程disable掉,这样可以避免因为修改系统时间而引起机器重启。

这段时间发现Oracle10g的CRS确实有些霸道,上次的测试中拔掉Private IP网卡上的网线,操作系统会重新启动,这次居然修改系统时间也会导致系统重启,真当这些机器是Windows了?UNIX Server中重启一次机器多大的事儿啊,CRS搞的跟吃饭一样随意,动不动reboot。

下面的这段资料描述了Oracle CRS的三个进程会在哪些状态下重新启动机器。

Oracle clusterware has the following three daemons which may be responsible for panicing the node. It is possible that some other external entity may have rebooted the node. In the context of this discussion, we will assume that the reboot/panic was done by an Oracle clusterware daemon.

* Oprocd  - Cluster fencing module
* Cssd - Cluster sychronization module which manages node membership
* Oclsomon - Cssd monitor which will monitor for cssd hangs

OPROCD This is a daemon that only gets activated when there is no vendor clusterware present on the OS. This daemon is also not activated to run on Windows/Linux.  This daemon runs a tight loop and if it is not scheduled for 1.5 seconds, will reboot the node.
CSSD This daemon pings the other members of the cluster over the private network and Voting disk. If this does not get a response for Misscount seconds and Disktimeout seconds respectively, it will reboot the node.
Oclsomon This daemon monitors the CSSD to ensure that CSSD is scheduled by the OS, if it detects any problems it will reboot the node.

需要找到方法去禁用这些reboot的特性,reboot了你又不能解决问题,瞎操什么心嘛。

 

Logo

更多推荐