1 启动集群,发现不能启动,,报如下错误
[root@testdb2 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
2 根据报错信息,按照网上的文档,删除/var/tmp/.oracle/npohasd 发现不能解决问题
[root@testdb2 ~]# cd /var/tmp
[root@testdb2 tmp]# rm -rf .oracle
[root@testdb2 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
3 然后对相关进程做TRACE
[root@testdb2 ~]# ps -ef|grep crsctl
root
32511 31666 0 17:34 pts/0 00:00:00 /u01/app/11.2.0/grid/bin/crsctl.bin start crs
root 36355 36109 0 17:36 pts/1 00:00:00 grep --color=auto crsctl
[root@testdb2 ~]# strace -p 32511
strace: Process 32511 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
open("/proc/self/status", O_RDONLY) = 3
read(3, "Name:\tcrsctl.bin\nUmask:\t0022\nSta"..., 4096) = 1334
close(3) = 0
access("/usr/lib64/qt-3.3/bin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/local/sbin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/local/bin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/sbin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/bin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/sbin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/bin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
access("/root/bin/crsctl.bin", F_OK) = -1 ENOENT (No such file or directory)
brk(NULL) = 0xefe000
brk(0xf3e000) = 0xf3e000
brk(NULL) = 0xf3e000
brk(0xfbe000) = 0xfbe000
brk(NULL) = 0xfbe000
brk(0xff6000) = 0xff6000
.........
41999 0.000019 access("/var/tmp/.oracle/npohasd", F_OK) = -1 ENOENT (No such file or directory) <0.000007>
41999 0.000020 access("/var/tmp/.oracle/npohasd", F_OK) = -1 ENOENT (No such file or directory) <0.000008>
41999 0.000019 access("/var/tmp/.oracle/npohasd", F_OK) = -1 ENOENT (No such file or directory) <0.000007>
41999 0.000020 access("/var/tmp/.oracle/npohasd", F_OK) = -1 ENOENT (No such file or directory) <0.000008>
由于多次删除/var/tmp/.oracle目录,并不能解决问题,根据如下信息,又发现/var/tmp/.oracle/npohasd文件不能访问
于是到Oracle官网查询相关信息,发现如下文档《
Linux: OS "init" process does not start init.ohasd in inittab (Doc ID 1591775.1)
To BottomTo Bottom
》说明是由于ohasd进程不能启动,导致crsctl 不能启动集群。
文档内容如下:
4 根据如上文档,认为是ohasd服务不能启动导致的,由于oracle11G在redhat7支持的不是很好,故怀疑
是自己创建的ohasd服务异常,导致的数据库集群不能启动。
查看ohas.service服务的状态,发现ohasd进程虽然是running,但提示有die,
[root@testdb2 tmp]# systemctl status ohas.service
● ohas.service - Oracle High Availability Services
Loaded: loaded (/usr/lib/systemd/system/ohas.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2021-10-22 08:41:05 CST; 50min ago
Main PID: 26360 (init.ohasd)
Tasks: 1
CGroup: /system.slice/ohas.service
└─26360 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Oct 22 08:41:25 testdb2 clsecho[29462]: /etc/init.d/init.ohasd: ohasd.bin
process 9443 died while waiting to move.
Oct 22 08:41:25 testdb2 init.ohasd[26360]: /etc/init.d/init.ohasd: ohasd.bin p
rocess 9443 died while waiting to move.
Oct 22 08:59:27 testdb2 clsecho[42022]: /etc/init.d/init.ohasd: 4999 > /sys/fs/cgroup/cpu,cpuacct/tasks
Oct 22 08:59:27 testdb2 init.ohasd[26360]: /etc/init.d/init.ohasd: 4999 > /sys/fs/cgroup/cpu,cpuacct/tasks
Oct 22 08:59:27 testdb2 init.ohasd[26360]: /bin/echo: write error: No such process
Oct 22 08:59:27 testdb2 clsecho[42025]: /etc/init.d/init.ohasd: 4999 > /sys/fs/cgroup/systemd/system.slice/oracle-ohasd.service/tasks
Oct 22 08:59:27 testdb2 init.ohasd[26360]: /etc/init.d/init.ohasd: 4999 > /sys/fs/cgroup/systemd/system.slice/oracle-ohasd.service/tasks
Oct 22 08:59:27 testdb2 init.ohasd[26360]: /bin/echo: write error: No such process
Oct 22 08:59:27 testdb2 clsecho[42032]: /etc/init.d/init.ohasd: ohasd.bin process 4999 died while waiting to move.
Oct 22 08:59:27 testdb2 init.ohasd[26360]: /etc/init.d/init.ohasd: ohasd.bin process 4999 died while waiting to m
ove.
5 按照
文档《
Linux: OS "init" process does not start init.ohasd in inittab (Doc ID 1591775.1) To BottomTo Bottom
》 修改主机的配置,再次启动数据库集群,发现集群可以正常启动了。
修改如下:
[root@testdb2 tmp]# cat /etc/inittab |grep -v "#" htfa:35:respawn:/etc/init.d/init.tfa run >/dev/null 2>&1 </dev/null h1:35:respawn:/etc/init.d/init.ohasd run
经过十几次的测试,以下命令可以正常启动了,再也没有发生CRS-4124 和CRS-4000的错误了。[root@testdb2 tmp]# /u01/app/11.2.0/grid/bin/crsctl start crsCRS-4123: Oracle High Availability Services has been started.
