Batch performance

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch performance

Roshan
Oracle DB 19c

Hello Erman,


we are having a batch performance issue where duration should have taken 4 hours compared to 7 hours. I have attached sample AWR reports. To note that DB is running on EXADATA. The redo log switch per hour is quite high (25 per hour). We have 10 redo log groups each 1 redo log 10G size.
awrrpt_1_11716_11717.html
awrrpt_1_11715_11716.html

Previously gather stats was running and we disabled the job. The runtime improved by 45 mins (7.45 hours compared to 7 hours)
awrrpt_1_26Jan24.html
Kindly advise if it is because of the redo logs

Thanks,
Roshan

Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

ErmanArslansOracleBlog
Administrator
I took a quick look, and I can say that you have lots of log file sync waits, and they take time..
You have also db link waits, and for that we can't do anything actually.
You have some heavy SQL workloads. These SQLs can not even complete their run in this AWR period. So you need to check them.. (check if they are necessary to be executed -- See AWR SQL Statistics section for this..)

Send me an AWR report while that problematic batch is running. (Note that; there should be a maximum of 1 hour between the AWR snapshots you use when creating the AWR report. Because, as you increase the time period for the AWR reports, it becomes harder to detect problems because the scale expands.)

Also send me a part of query (first part maybe) of the problematic batch.. (it should be there in AWR with all the wait metrics associated with it..)
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

Roshan
Hello Erman,

we are getting timeouts when doing perf tests.

Please find attached AWR report.

awrrpt_1_11777_11778.html
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

Roshan
Do you think some tunings need to be done at storage level
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

ErmanArslansOracleBlog
Administrator
*Send me LGWR Trace.
*Send me  Cellsrv - alert.log
*Send me the Alert Log (not the whole of it, just the related part)
*Your DB version - 4 digit, Your Exadata Image version and Your Exadata Generation and model (For instance X8-2M)
*Any IORM Plan in place / active?
*Also go through the MOS Note Exadata: Troubleshooting High Log File Parallel Write Waits (Doc ID 2634755.1), and send me your findings.
*Also review the ADDM findings:

Findings
--------------------
For Read I/O Storage seems good. See TBS IO Stats .. Most of the time, you re doing your I/Os in less than 2 ms.
But! For the LGWR IO, there may be something.. Actually it is in us second level, not bad but it is related with the waits.

Regarding the log file sync events;

LGWR writes commit records and writes enough information to recover transactions.
LGWR performance may impact negatively on OLTP response times.
For instance; in case of a commit operation, that is when  a session wants to commit, it posts the LGWR to write the redo information of its transactions and commit record, as well.
In this state, the session waits the LGWR and its wait event will be Log file Sync.. The session waits in this wait while LGWR is writing the redo information in to the redologs..
In details we can say that;

When a user session commits or rolls back , Oracle/LGWR needs to flush the contents of log buffer to the Redolog files. So, Log file sync occcurs in these operations.
When we wait for log file sync, we actually wait for; LGWR to wake up if idle, gather the redo to be written, issue IO (this IO is a log file parallel write, can be monitored using log file parallel wait event), process after the IO and acknowledge user session/foreground..
In addition to that, we wait for user session/foreground to wake up lastly.

*If avg time per log file sync operation << avg wait time per log file parallel write operation -> IO is bottleneck.
*If the avg wait time per log file sync operation >> avg wait time per log -> CPU bottlenecek , Run queue problem or bug. In such a situation, check CPU consumption..

you got 162.95us for the average log file parallel write..
you got 210.09us for the average log file sync

Also check the top blocking sessions event ->

Top Blocking Sessions .. There is LGWR there, and in the top session you also got LGWR.
That top LGWR session is waited on log parallel write mostly.



Also , your Flash cache hit is low and you got something with the parses.
You re mostly on the CPU, but there are also log file sync events.

For the CPU, you need to check the CPU related things. Start by analyzing Top SQL SQL ordered by CPU Time. Tuning a little thing may have e big effect on that, because you got lots of executions and those executions in total makes that CPU time.
You may check these sqls ->
hgb9kh19dvzz
1f3n8fuf5sy5r
19x1189chq3xd

Also someone runs a query using SQL Developer.. Why? -> buw2v54n4sjrb

ADDM Findings (related part)
------------------------------
Finding 2: Commits and Rollbacks
Impact is .21 active sessions, 8.3% of total activity.
------------------------------------------------------
Waits on event "log file sync" while performing COMMIT and ROLLBACK operations
were consuming significant database time.

   Recommendation 1: Application Analysis
   Estimated benefit is .21 active sessions, 8.3% of total activity.
   -----------------------------------------------------------------
   Action
      Investigate application logic for possible reduction in the number of
      COMMIT operations by increasing the size of transactions.
   Rationale
      The application was performing 57801 transactions per minute with an
      average redo size of 3221 bytes per transaction.

   Recommendation 2: Host Configuration
   Estimated benefit is .21 active sessions, 8.3% of total activity.
   -----------------------------------------------------------------
   Action
      Investigate the possibility of improving the performance of I/O to the
      online redo log files.
   Rationale
      The average size of writes to the online redo log files was 3 K and the
      average time per write was 0 milliseconds.
   Rationale
      The total I/O throughput on redo log files was 7.8 M per second for
      reads and 3.3 M per second for writes.
   Rationale
      The redo log I/O throughput was divided as follows: 0% by RMAN and
      recovery, 29% by Log Writer, 0% by Archiver, 0% by Streams AQ and 70% by
      all other activity.

   Symptoms That Led to the Finding:
   ---------------------------------
      Wait class "Commit" was consuming significant database time.
      Impact is .2 active sessions, 7.91% of total activity
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

Roshan
t24muppd1_lgwr_179850.trc
alert_cell.log
lfsdiag_t24mu_Jan30_1230.out

Alert log:
2024-01-29T19:01:50.247794+04:00
ARC1 (PID:284997): Archived Log entry 37951 added for T-1.S-19080 ID 0x59710513 LAD:1
2024-01-29T19:06:49.667786+04:00
Thread 1 advanced to log sequence 19082 (LGWR switch),  current SCN: 1850015957454
  Current log# 3 seq# 19082 mem# 0: +RECOC1/T24MUPPD/ONLINELOG/group_3.9998.1159317833
2024-01-29T19:06:50.267504+04:00
ARC2 (PID:284999): Archived Log entry 37953 added for T-1.S-19081 ID 0x59710513 LAD:1
2024-01-29T19:11:49.742574+04:00
Thread 1 advanced to log sequence 19083 (LGWR switch),  current SCN: 1850017400928
  Current log# 4 seq# 19083 mem# 0: +RECOC1/T24MUPPD/ONLINELOG/group_4.25214.1159317921
2024-01-29T19:11:50.244057+04:00
ARC3 (PID:285001): Archived Log entry 37955 added for T-1.S-19082 ID 0x59710513 LAD:1
2024-01-29T19:16:49.786120+04:00
Thread 1 advanced to log sequence 19084 (LGWR switch),  current SCN: 1850019678281
  Current log# 2 seq# 19084 mem# 0: +RECOC1/T24MUPPD/ONLINELOG/group_2.20885.1159319189
2024-01-29T19:16:50.487459+04:00
ARC0 (PID:284989): Archived Log entry 37957 added for T-1.S-19083 ID 0x59710513 LAD:1
2024-01-29T19:21:49.825434+04:00
Thread 1 advanced to log sequence 19085 (LGWR switch),  current SCN: 1850022603933
  Current log# 1 seq# 19085 mem# 0: +RECOC1/T24MUPPD/ONLINELOG/group_1.8494.1159318507
2024-01-29T19:21:50.709712+04:00
ARC1 (PID:284997): Archived Log entry 37959 added for T-1.S-19084 ID 0x59710513 LAD:1
2024-01-29T19:26:49.873426+04:00
Thread 1 advanced to log sequence 19086 (LGWR switch),  current SCN: 1850025333557
  Current log# 5 seq# 19086 mem# 0: +RECOC1/T24MUPPD/ONLINELOG/group_5.8787.1159318765
2024-01-29T19:26:50.793209+04:00
ARC2 (PID:284999): Archived Log entry 37961 added for T-1.S-19085 ID 0x59710513 LAD:1
2024-01-29T19:31:49.904342+04:00
Thread 1 advanced to log sequence 19087 (LGWR switch),  current SCN: 1850027318572
  Current log# 3 seq# 19087 mem# 0: +RECOC1/T24MUPPD/ONLINELOG/group_3.9998.1159317833
2024-01-29T19:31:51.281191+04:00
ARC3 (PID:285001): Archived Log entry 37963 added for T-1.S-19086 ID 0x59710513 LAD:1
2024-01-29T19:36:49.985037+04:00
Thread 1 advanced to log sequence 19088 (LGWR switch),  current SCN: 1850029142654
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

ErmanArslansOracleBlog
Administrator
*I couldn't see anything that supports your theory here..
LGWR I/O perf looks good.
However; you do lots of commits.. ( transaction per second and user commits per seconds are almost equal)
Frequent commits in between each transaction/DML can cause the foreground session to wait for "log file sync" event.
Just tell your developers to remove the unnecessary commits from their PL/SQL blocks..

Also from MOS (Ref : Oracle) ->

If there are lots of short duration transactions, see if it is possible to group transactions together so there are fewer distinct COMMIT operations. Since it is mandatory for each commit to receive confirmation that the relevant REDO is on disk, additional commits can add significantly to the overhead. Although commits can be "piggybacked" by Oracle, reducing the overall number of commits by batching transactions can have a very beneficial effect.
See if any of the processing can use the COMMIT NOWAIT option (be sure to understand the semantics of this before writing application code to use it).
See if any activity can safely be done with NOLOGGING / UNRECOVERABLE options.



*You got high log file switch count.. Investigate it.. If necessary you may consider increasing your redo sizes. (having lots of log switches will have an affect on the time foreground process will need to wait for 'log file sync' waits to complete because the overhead of initiating the sync operation more than necessary)

*Check ->
Exadata:

Page 11 of following pertains to not putting redo on SSD/Flash storage except with Engineered Systems.

http://www.oracle.com/technetwork/database/exadata/exadata-smart-flash-cache-366203.pdf

General Ref.:
Troubleshooting: 'Log file sync' Waits (Doc ID 1376916.1)
How to interpret AWR when there are Log File Sync Waits due to high number of COMMITS (Doc ID 2277523.1)
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

Roshan
Hello Erman,

yes, the load is very low on both production and UAT. Average wait time is in microseconds. Kindly advise what could you slow performance in batch? Is it network?
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

Roshan
I did a netstat
        0 connections with ECN capability
        0 times responded to ECN
        6540 embryonic connections dropped
        276452712 segments updated rtt (of 276459408 attempts)
        0 segments with congestion window reduced bit set
        0 segments with congestion experienced bit set
        0 resends due to path MTU discovery
        0 path MTU discovery terminations due to retransmits
        17318 retransmit timeouts
                7 connections dropped by rexmit timeout
        185 fast retransmits
                1 when congestion window less than 4 segments
        8195 newreno retransmits
        285 times avoided false fast retransmits
        0 persist timeouts
                0 connections dropped due to persist timeout
        2978 keepalive timeouts
                31 keepalive probes sent
                2947 connections dropped by keepalive
        0 times SACK blocks array is extended
        0 times SACK holes array is extended
        0 packets dropped due to memory allocation failure
        0 connections in timewait reused
        0 delayed ACKs for SYN
        0 delayed ACKs for FIN
        0 send_and_disconnects
        0 spliced connections
        0 spliced connections closed
        0 spliced connections reset
        0 spliced connections timeout
        0 spliced connections persist timeout
        0 spliced connections keepalive timeout
        0 TCP checksum offload disabled during retransmit
        15282 Connections dropped due to bad ACKs
        0 Connections dropped due to duplicate SYN packets
        0 fastpath loopback connections
        0 fastpath loopback sent packets (0 bytes)
        0 fastpath loopback received packets (0 bytes)
        0 fake SYN segments dropped
        0 fake RST segments dropped
        3 data injection segments dropped
        0 TCPTR max connections dropped
        0 TCPTR connections dropped for no memory
        0 TCPTR maximum per host connections dropped
        0 connections dropped due to max assembly queue depth
        0 exhausted ephemeral ports errors
Reply | Threaded
Open this post in threaded view
|

Re: Batch performance

ErmanArslansOracleBlog
Administrator
In reply to this post by ErmanArslansOracleBlog
What kinds of timeouts do you get?
Send me a screenshot of those timeouts related errors.
And when you get those timeouts what do you get in the Oracle Session and do you have anything ORA- or TNS- related error in the alert log?