Redhat linux - issue analysis

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Redhat linux - issue analysis

satish
Dear Erman,

We are using redhat cluster in our 2 node ebs instance.

2 node R12.2.5
os version Redhat linux 7

On 11th Aug 2020 around 07:30pm IST,Our application servers stopped responding.There is no high load or cpu usage on the server but some cpu events are detected in /var/log/messages file.Also i have verified sar outputs during that time and there is no high cpu usage during that time.

Attached messages file and sar output for your reference.Could you please help us to find root cause.

Thanks,
Satish

messagesbkp11.messagesbkp11hsar.txt
Reply | Threaded
Open this post in threaded view
|

Re: Redhat linux - issue analysis

ErmanArslansOracleBlog
Administrator
This seems to be related with Redhat.
You have failures for cluster processes during that time period.

It all starts with the following;

Aug 11 19:21:06 corosync[1970]: [TOTEM ] A processor failed, forming new configuration.
Aug 11 19:21:08 corosync[1970]: [TOTEM ] A new membership  was formed. Members left: 2
Aug 11 19:21:08 corosync[1970]: [TOTEM ] Failed to receive the leave message. failed: 2
Aug 11 19:21:08 corosync[1970]: [QUORUM] Members[1]: 1

Then you lost your  connection to node 2 in the cluster.
Then you have more process errors and you have a call stack there.
It goes on and on..

This seems like node fencing. The root cause may be server load or a connection loss in the network, or totem configuration or quorum configuration. But it needs to be investigated.

One final note , you say there is no high cpu usage in that time, but we see the following in the log ->

crmd[2258]:  notice: High CPU load detected:

CPU load start increasing more and more after that time..

It seems you processes  couldn't do IO too.
From the call trace , I see the following ->

down_read+0x20/0x30..

your processes tried to do for some period of time and they couldn't do that as they are blocked.

This may be related with the device drivers or with multipathd.

This is a pure Redhat Issue. Please open a SR to Redhat. A full check is necessary. Both for Cluster + OS.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Redhat linux - issue analysis

satish
Hi erman,

You are obsolutely correct.There was cable program that connected to storage.But one I thing I don’t understand is those High cpu messages in /var log/messages file.Initially network team claimed that some application processes would have consumed high cpu.But I am sure that this is not the case.But I don’t have an answer to prove this as they are showing high cpu message from messages log file.

Can these high cpu usage messages in log file can happen as a result of cable failure?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Redhat linux - issue analysis

ErmanArslansOracleBlog
Administrator
Yes it might cause that..
Maybe not High Cpu usage, but High Load..
I/O is an uninterruptable event. So crashing I/Os will be in D state. D state processes will increase load.

On the other hand, this kind of a problem may increase the CPU usage of some processes as well. It depends on the code.. The code of that apps/processes. It depends on the exception handling and cluster software.. So a deep investigation is necessary there.

Glad that you found the cause..