Tuesday, February 5, 2019

How to resolve CRS-4535 CRS-4534: Cannot communicate with Event Manager and Cannot communicate with Cluster Ready Services?

PROBLEM:

We had an error trying to start the database CRS Cluster Ready services

 Application connections were getting TNS errors not able to identify a service:

ORA-12514, TNS:listener does not currently know of service requested in connect descriptor

Cluster check as user root shows error on all nodes, and would return an error when starting.

[test0b.test.com: trace]# crsctl check cluster -all
 **************************************************************
 test0a:
CRS-4535: Cannot communicate with Cluster Ready Services
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************
 test0b:
 CRS-4535: Cannot communicate with Cluster Ready Services
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************
test0c:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager

**************************************************************
​test0d:
 CRS-4535: Cannot communicate with Cluster Ready Services
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************
 test0e:
 CRS-4535: Cannot communicate with Cluster Ready Services
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************
 test0f:
 CRS-4535: Cannot communicate with Cluster Ready Services
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************

[test0a.test.com: bin]# ./crsctl start res ora.crsd -init
 CRS-2672: Attempting to start 'ora.crsd' on 'test0a'
 CRS-2674: Start of 'ora.crsd' on 'test0a' failed
 CRS-2679: Attempting to clean 'ora.crsd' on 'test0a'
 CRS-2681: Clean of 'ora.crsd' on 'test0a' succeeded
 CRS-4000: Command Start failed, or completed with errors.

crsctl status res -t -init --> was showing:
ora.evmd      1        ONLINE  INTERMEDIATE test0c               STABLE​

grid@test0c.test.com:$ crsctl query crs activeversion;

CRS-6750: unable to get the active version
CRS-6752: Active version query failed.

grid@test0c.test.com:$ crsctl query crs activeversion;​ --> This command was working fine on other nodes​
grid@test0c.test.com:$ crsctl query crs activeversion;​ --> This command was working fine on other nodes
grid@test0c.test.com:$ ocrcheck;​ --> This command was working fine all on all nodes except node "C"​

Issue started on node "C". When we tried to start CSR on this node, we had following errors in crs log (/u01/app/grid/diag/crs/test0c/crs/trace/alert.log) as:

Note: In 11g cluster logs are in:

/u01/app/11.2.0.4/grid/log/test08/alerttest8.log

cd /u01/app/11.2.0.4/grid/log/test08/client

crsctl_root.log

crswrapexece.log

emcrsp.log

crsctl_orarom.log

olsnodes.log​


In /u01/app/grid/diag/crs/test0c/crs/trace/alert.log:

2019-01-04 01:00:43.320 [ORAROOTAGENT(6205)]CRS-5019: All OCR locations are on ASM disk groups [OCR], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/grid/diag/crs/test0c/crs/trace/ohasd_orarootagent_root.trc".

2019-01-04 01:03:45.723 [OCRCHECK(67613)]CRS-1013: The OCR location in an ASM disk group is inaccessible.

2019-01-04 01:03:45.723 [OCRCHECK(67613)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u01/app/grid/diag/crs/test0c/crs/trace/ocrcheck_67613.trc.

SOLUTION:

There were errors on multiple nodes, but the general fix was to clear some files that are used to identify process state, and then restart.

The files cleared were in the following directory:  $GRID_HOME/crs/init

grid@test0a.test.com:$ pwd
/u01/app/12.2.0.1/grid/crs/init
grid@test0a.test.com:$ ls -lt
total 72
-rw-r--r-- 1 root root 0 Jan 4 05:36 test0a
-rw-r--r-- 1 root root 6 Jan 4 05:36 test0a.pid
-rw-r--r-- 1 root root 6939 Sep 8 16:44 afd
-rw-r--r-- 1 root root 7193 Sep 8 16:44 afd.sles
-rw-r--r-- 1 root root 11878 Sep 8 16:44 init.ohasd
-rw-r--r-- 1 root root 12199 Sep 8 16:44 init.ohasd.sles
-rw-r--r-- 1 root root 7394 Sep 8 16:44 ohasd
-rw-r--r-- 1 root root 7715 Sep 8 16:44 ohasd.sles
-rw-r--r-- 1 root root 4347 Sep 8 16:44 oka
-rw-r--r-- 1 root root 564 Sep 8 16:44 olfs

 After the files were removed we were able to restart the cluster services without error (this needed to be done on all nodes for a clean start)

Additionally cleared other files on test0c

Cleared all the files under /var/tmp/.oracle/* , /tmp/.oracle/* ,/usr/tmp/.oracle/*​, /u01/app/12.2.0.1/grid/ctss/init​, /u01/app/grid/crsdata/test0c/output​

Finally after the extra files were moved and the below command to start resource was run it all came back clean.

crsctl start res ora.crsd -init 


No comments:

Post a Comment