Erman Arslan's Oracle Forum › RAC

start cluster error

Classic

List

Threaded

16 messages Options

Roshan

start cluster error

RAC(ASM) 11g
Hi Erman,

I have started cluster for a node. It is completing with error
ORA11g_DB2>./crsctl start cluster
CRS-2672: Attempting to start 'ora.cssd' on 'rhis-cr-0613-04'
CRS-2672: Attempting to start 'ora.diskmon' on 'rhis-cr-0613-04'
CRS-2676: Start of 'ora.diskmon' on 'rhis-cr-0613-04' succeeded
CRS-2674: Start of 'ora.cssd' on 'rhis-cr-0613-04' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'rhis-cr-0613-04'
CRS-2681: Clean of 'ora.cssd' on 'rhis-cr-0613-04' succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start 'ora.cssd' on 'rhis-cr-0613-04'
CRS-2672: Attempting to start 'ora.diskmon' on 'rhis-cr-0613-04'
CRS-2676: Start of 'ora.diskmon' on 'rhis-cr-0613-04' succeeded
CRS-2674: Start of 'ora.cssd' on 'rhis-cr-0613-04' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'rhis-cr-0613-04'
CRS-2681: Clean of 'ora.cssd' on 'rhis-cr-0613-04' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rhis-cr-0613-04'
CRS-2672: Attempting to start 'ora.diskmon' on 'rhis-cr-0613-04'
CRS-5804: Communication error with agent process
CRS-2676: Start of 'ora.diskmon' on 'rhis-cr-0613-04' succeeded
CRS-2674: Start of 'ora.cssd' on 'rhis-cr-0613-04' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'rhis-cr-0613-04'
CRS-2681: Clean of 'ora.cssd' on 'rhis-cr-0613-04' succeeded
CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
ORA11g_DB2>id

I have checked ossd logfile;

2017-06-27 11:51:13.112: [ CSSD][1100745024]clssnmvDHBValidateNCopy: node 1, rhis-cr-0613-03, has a disk HB, but no network HB, DHB has rcfg 266426551, wrtcnt, 126894370, LATS 1054544, lastSeqNo 126894369, uniqueness 1469772618, timestamp 1498549872/3050568238

ORA11g_DB2>./crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

Please advise,

Regards

ErmanArslansOracleBlog

Re: start cluster error

Administrator

Normally, if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made.
However; in this case, it is obvious that you have a communication problem. As you see that network related error in CSSD logs, then it is better to solve it first. (it is seen in "crsctl start cluster output", as well.)
You are facing problem almost in the first stages of a grid start.. (you are failing while communicating cssdagent - Agent responsible for spawning CSSD.THis means you are failing while starting CSSD.)

Do you have problem with your private interconnect interfaces?
Please check those. (check if they are up, check if they are up with correct IP addresses)
Check this note as well: GI Fails to Start as no Private Network Interface is Available (Doc ID 1481176.1)

Send me the logs, and the state of your private network interfaces for further diagnostics.

Note that: Your Cssd agent may not be running or can not be started as well. You should check OHASD logs, as well, if it is the case. (OHASD spawns cssd agent)

Roshan

Re: start cluster error

Hi,

please find attached some tests I did both on working and issue node
privateIPtests.txt

From working node
ping 192.168.124.61
PING 192.168.124.61 (192.168.124.61) 56(84) bytes of data.
64 bytes from 192.168.124.61: icmp_seq=1 ttl=64 time=0.168 ms
64 bytes from 192.168.124.61: icmp_seq=2 ttl=64 time=0.196 ms
--- 192.168.124.61 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.168/0.182/0.196/0.014 ms
[root@RHIS-CR-0613-03 ~]# traceroute 192.168.124.61
traceroute to 192.168.124.61 (192.168.124.61), 30 hops max, 40 byte packets
1 rac2-priv (192.168.124.61) 0.126 ms 0.098 ms 0.088 ms

From issue node:
# ping 192.168.124.60
PING 192.168.124.60 (192.168.124.60) 56(84) bytes of data.
64 bytes from 192.168.124.60: icmp_seq=1 ttl=64 time=0.095 ms
64 bytes from 192.168.124.60: icmp_seq=2 ttl=64 time=0.165 ms

traceroute 192.168.124.60
traceroute to 192.168.124.60 (192.168.124.60), 30 hops max, 40 byte packets
1 rac1-priv (192.168.124.60) 0.213 ms 0.173 ms 0.160 ms

Roshan

Re: start cluster error

Extract from log file - OHASD

017-06-27 13:57:03.768: [ CRSPE][1150110016] {0:48:4} CRS-2676: Start of 'ora.cssdmonitor' on 'rhis-cr-0613-04' succeeded

2017-06-27 13:57:03.768: [ CRSPE][1150110016] {0:48:4} PE Command [ Resource State Change (ora.cssdmonitor 1 1) : 0x19e54410 ] has completed
2017-06-27 13:57:03.769: [ AGFW][1139603776] {0:48:4} Agfw Proxy Server received the message: CMD_COMPLETED[Proxy] ID 20482:2540
2017-06-27 13:57:03.769: [ AGFW][1139603776] {0:48:4} Agfw Proxy Server replying to the message: CMD_COMPLETED[Proxy] ID 20482:2540
2017-06-27 13:57:03.769: [ AGFW][1139603776] {0:48:4} Agfw received reply from PE for resource state change for ora.cssdmonitor 1 1
2017-06-27 13:57:03.774: [ AGFW][1139603776] {0:52:2} Received the reply to the message: RESTYPE_ADD[ora.cssd.type] ID 8196:2533 from the agent /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root
2017-06-27 13:57:03.776: [ AGFW][1139603776] {0:52:2} Received the reply to the message: RESOURCE_ADD[ora.cssd 1 1] ID 4356:2534 from the agent /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root
2017-06-27 13:57:03.776: [ AGFW][1139603776] {0:43:4} Received the reply to the message: RESOURCE_CLEAN[ora.cssd 1 1] ID 4100:2535 from the agent /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root
2017-06-27 13:57:03.776: [ AGFW][1139603776] {0:43:4} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.cssd 1 1] ID 4100:2512
2017-06-27 13:57:03.777: [ CRSPE][1150110016] {0:43:4} Received reply to action [Clean] message ID: 2512
2017-06-27 13:57:03.777: [ AGFW][1139603776] {0:43:4} Received the reply to the message: RESOURCE_CLEAN[ora.cssd 1 1] ID 4100:2535 from the agent /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root
2017-06-27 13:57:03.778: [ AGFW][1139603776] {0:43:4} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_CLEAN[ora.cssd 1 1] ID 4100:2512
2017-06-27 13:57:03.778: [ CRSPE][1150110016] {0:43:4} Received reply to action [Clean] message ID: 2512
2017-06-27 13:57:03.778: [ CRSPE][1150110016] {0:43:4} RI [ora.cssd 1 1] new internal state: [STABLE] old value: [CLEANING]
2017-06-27 13:57:03.778: [ CRSPE][1150110016] {0:43:4} CRS-2681: Clean of 'ora.cssd' on 'rhis-cr-0613-04' succeeded

2017-06-27 13:57:03.778: [ AGFW][1139603776] {0:43:4} Agfw Proxy Server received the message: AGENT_SHUTDOWN_REQUEST[Proxy] ID 20486:22
2017-06-27 13:57:03.779: [ AGFW][1139603776] {0:43:4} Shutdown request received from /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root
2017-06-27 13:57:03.779: [ AGFW][1139603776] {0:43:4} Agfw Proxy Server replying to the message: AGENT_SHUTDOWN_REQUEST[Proxy] ID 20486:22
2017-06-27 13:57:03.779: [ CRSPE][1150110016] {0:43:4} Sequencer for [ora.cssd 1 1] has completed with error: CRS-5804: Communication error with agent process

2017-06-27 13:57:03.779: [ CRSPE][1150110016] {0:43:4} Starting resource state restoration for: START of [ora.cssd 1 1] on [rhis-cr-0613-04] : local=1, unplanned=00x1a0d31d0
2017-06-27 13:57:03.779: [ CRSPE][1150110016] {0:43:4} PE Command [ Resource State Change (ora.cssdmonitor 1 1) : 0x2aaab806abd0 ] has completed
2017-06-27 13:57:03.779: [ AGFW][1139603776] {0:43:4} Agfw Proxy Server received the message: CMD_COMPLETED[Proxy] ID 20482:2546
2017-06-27 13:57:03.779: [ AGFW][1139603776] {0:43:4} Agfw Proxy Server replying to the message: CMD_COMPLETED[Proxy] ID 20482:2546
2017-06-27 13:57:03.780: [ AGFW][1139603776] {0:43:4} Agfw received reply from PE for resource state change for ora.cssdmonitor 1 1
2017-06-27 13:57:14.087: [ CRSCOMM][1102092608][FFAIL] Ipc: Couldnt clscreceive message, no message: 11
2017-06-27 13:57:14.087: [ CRSCOMM][1102092608] Ipc: Client disconnected.
2017-06-27 13:57:14.087: [ CRSCOMM][1102092608][FFAIL] IpcL: Listener got clsc error 11 for memNum. 52
2017-06-27 13:57:14.087: [ CRSCOMM][1102092608] IpcL: connection to member 52 has been removed
2017-06-27 13:57:14.087: [CLSFRAME][1102092608] Removing IPC Member:{Relative|Node:0|Process:52|Type:3}
2017-06-27 13:57:14.087: [CLSFRAME][1102092608] Disconnected from AGENT process: {Relative|Node:0|Process:52|Type:3}
2017-06-27 13:57:14.087: [ CRSPE][1150110016] {0:0:712} Disconnected from server:
2017-06-27 13:57:14.087: [ AGFW][1139603776] {0:0:715} Agfw Proxy Server received process disconnected notification, count=1
2017-06-27 13:57:14.088: [ AGFW][1139603776] {0:0:715} /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root disconnected.
2017-06-27 13:57:14.088: [ AGFW][1139603776] {0:0:715} Agent /oracle/grid11g/grid_infra/11.2.0/bin/cssdagent_root[10887] stopped!
2017-06-27 13:57:14.088: [ CRSCOMM][1139603776] {0:0:715} IpcL: removeConnection: Member 52 does not exist.
2017-06-28 09:20:14.829: [UiServer][1085446464] CS(0x2aaab8055db0)set Properties ( root,0x1a09f8c0)
2017-06-28 09:20:14.829: [UiServer][1085446464] SS(0x19f7ecc0)Accepted client connection: saddr =(ADDRESS=(PROTOCOL=ipc)(DEV=700)(KEY=OHASD_UI_SOCKET))daddr = (ADDRESS=(PROTOCOL=ipc)(KEY=OHASD_UI_SOCKET))
2017-06-28 09:20:14.841: [UiServer][1152211264] {0:0:719} processMessage called
2017-06-28 09:20:14.842: [UiServer][1152211264] {0:0:719} Sending message to PE. ctx= 0x19f671e0, Client PID: 18030
2017-06-28 09:20:14.842: [UiServer][1152211264] {0:0:719} Sending command to PE: 3
2017-06-28 09:20:14.842: [ CRSPE][1150110016] {0:0:719} Processing PE command id=131. Description: [Stat Resource : 0x19f4aec0]
2017-06-28 09:20:14.857: [UiServer][1152211264] {0:0:719} Done for ctx=0x19f671e0
2017-06-28 09:20:14.869: [UiServer][1085446464] Closed: remote end failed/disc.

Regards.
Roshan

ErmanArslansOracleBlog

Re: start cluster error

Administrator

Please send ocssd.log from both of the nodes.
I will check whether the CSSD is picking up the correct interface.

ErmanArslansOracleBlog

Re: start cluster error

Administrator

Also send me the following;

1)output of command "ping <private_ip_address>" from both of the nodes.

2)output of command "ethtool eth3" from both of the nodes

ErmanArslansOracleBlog

Re: start cluster error

Administrator

Send me your OS vendor+version and database+GRID version as well.

Roshan

Re: start cluster error

Hi,

please find attached.
node52 folder is the working node
node53 folder is the issue node
crestelRAC.rar

./crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.2.0.3.0]

Red Hat Enterprise Linux Server release 5.8 (Tikanga)

Database version: 11.2.0.3.0

Roshan

Re: start cluster error

From issue node:
# ping 192.168.124.60
PING 192.168.124.60 (192.168.124.60) 56(84) bytes of data.
64 bytes from 192.168.124.60: icmp_seq=1 ttl=64 time=0.095 ms
64 bytes from 192.168.124.60: icmp_seq=2 ttl=64 time=0.165 ms

traceroute 192.168.124.60
traceroute to 192.168.124.60 (192.168.124.60), 30 hops max, 40 byte packets
1 rac1-priv (192.168.124.60) 0.213 ms 0.173 ms 0.160 ms

#From working node
ping 192.168.124.61
PING 192.168.124.61 (192.168.124.61) 56(84) bytes of data.
64 bytes from 192.168.124.61: icmp_seq=1 ttl=64 time=0.168 ms
64 bytes from 192.168.124.61: icmp_seq=2 ttl=64 time=0.196 ms
--- 192.168.124.61 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.168/0.182/0.196/0.014 ms
[root@RHIS-CR-0613-03 ~]# traceroute 192.168.124.61
traceroute to 192.168.124.61 (192.168.124.61), 30 hops max, 40 byte packets
1 rac2-priv (192.168.124.61) 0.126 ms 0.098 ms 0.088 ms

ErmanArslansOracleBlog

Re: start cluster error

Administrator

Well.. the nodes can be pinged through private interfaces.
Your firewall may be running and blocking the private interfaces.

Please check it and disable it on both nodes;

service iptables stop
service ip6tables stop

then permenantly disable it.

chkconfig iptables off
chkconfig ip6tables off

update me with the outcome.

Reference: 11gR2 Grid: root.sh Fails to Start the Clusterware on the Second Node Due to Firewall on Private Network (Doc ID 981357.1)

Roshan

Re: start cluster error

It should normally work. The system admin has disabled cluster services on this node. I think this is why it is not starting.

ErmanArslansOracleBlog

Re: start cluster error

Administrator

So, the issue solved??

ErmanArslansOracleBlog

Re: start cluster error

Administrator

what you mean by "System admin disabled the cluster services"?

ErmanArslansOracleBlog

Re: start cluster error

Administrator

Update please.

Bhupendra yadav

how to start cluster on all nodes

Hi,

I am trying to start cluster on all nodes from node1.

below is the steps i am following.

crsctl enable crs
crsctl start has
crsctl start cluster -all

but above works only on node1 not on other nodes.

Please help.

ErmanArslansOracleBlog

Re: how to start cluster on all nodes

Administrator

Why?

1)Are those services configured to run only the first node?
2)Or are those services getting errros while starting on the second node?