Re: Redhat linux - issue analysis
Posted by ErmanArslansOracleBlog on Aug 13, 2020; 7:43am
URL: http://erman-arslan-s-oracle-forum.124.s1.nabble.com/Redhat-linux-issue-analysis-tp8621p8623.html
This seems to be related with Redhat.
You have failures for cluster processes during that time period.
It all starts with the following;
Aug 11 19:21:06 corosync[1970]: [TOTEM ] A processor failed, forming new configuration.
Aug 11 19:21:08 corosync[1970]: [TOTEM ] A new membership was formed. Members left: 2
Aug 11 19:21:08 corosync[1970]: [TOTEM ] Failed to receive the leave message. failed: 2
Aug 11 19:21:08 corosync[1970]: [QUORUM] Members[1]: 1
Then you lost your connection to node 2 in the cluster.
Then you have more process errors and you have a call stack there.
It goes on and on..
This seems like node fencing. The root cause may be server load or a connection loss in the network, or totem configuration or quorum configuration. But it needs to be investigated.
One final note , you say there is no high cpu usage in that time, but we see the following in the log ->
crmd[2258]: notice: High CPU load detected:
CPU load start increasing more and more after that time..
It seems you processes couldn't do IO too.
From the call trace , I see the following ->
down_read+0x20/0x30..
your processes tried to do for some period of time and they couldn't do that as they are blocked.
This may be related with the device drivers or with multipathd.
This is a pure Redhat Issue. Please open a SR to Redhat. A full check is necessary. Both for Cluster + OS.
Thanks.