Re: RAC instance crash /ORA-29770: global enqueue process
Posted by ErmanArslansOracleBlog on Nov 26, 2019; 1:53pm
URL: http://erman-arslan-s-oracle-forum.124.s1.nabble.com/RAC-instance-crash-tp7856p7890.html
----* NEXT TIME, please upload your trace files.. Don't copy/paste them...--
LHMB is the Global Cache/Enqueue Service Heartbeat Monitor process.
It monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning.
It seems your IPC0 process is in hung status ..
"IPC0 (ospid: 368909) has not moved for 108 sec (1573815101.1573814993)"
"IPC0 (ospid: 368909) has not moved for 123 sec (1573815116.1573814993)"
This is where it breaks..
LGWR is killed at this point.
"Forcibly terminated 'oracle@exa-hddbadm01.core.iamas.gov.az (LGWR)' process"
Look athe call stack for the kill->
"Abridged Call Stack Trace -----
ksedsts()+346<-kjzduptcctx()+868<-kjzdicrshnfy()+1113<-ksuitm_opt()+1678<-kjgcr_KillInstance()+170<-kjgcr_RunCallback()+1242<-kjgcr_DoAction()+186<-kjgcr_servicecssg()+856<-kjgcr_Main()+418<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202
<-ssthrdmain()+417<-main()+262<-__libc_start_main()+253"
After the LGWR is killed, instance is terminated by LMHB.
"LMHB (ospid: 368993): terminating the instance due to error 29770"
That's why your instance gets killed -> kjfmGCR_HBdisambig: action=Inst-kill
Look your LOAD AVERAGE is very high here ->
===[ System Load State ]===
CPU Total 96 Raw 96 Core 48 Socket 2
Load high: Cur 123801 Highmark 122880 (483.59 480.00)
Note that, probably Oracle gets the load information from "cpu_sup - A CPU Load and CPU Utilization Supervisor Process".. That 's why those Cur and HighMark values are divided to 256..
*But why is that? Probably we can't get cpu cycles.. I mean the background processes..
Did you check what is causing that load?
It is a sudden high load.. It is a peak.. Look -> loadavg : 406.77 117.41 42.64
"""At this point, I strongy recommend analyzing your environment for this load.. Currently I think that this is caused by your high load."""
Take a look at this note as well:
ORA-29770: global enqueue process LGWR (ospid: xxxxxx) is hung for more than 70 seconds (Doc ID 2457765.1)
The patches provided in the above note are not directly addressing your issue, but you may still consider applying them.