Bug Note 42470 - Reclaim GC gem aborting while reclaiming dead can corrupt repository

Critical

GemStone/S 64 Bit

3.1.0.1, 3.1, 3.0.1, 3.0, 2.4.5.3, 2.4.5.2, 2.4.5.1, 2.4.5, 2.4.4.8, 2.4.4.7, 2.4.4.6, 2.4.4.5, 2.4.4.4, 2.4.4.3, 2.4.4.2, 2.4.4.1, 2.4.4, 2.4.3, 2.4.2.1, 2.4.2, 2.4.1, 2.4, 2.2.6, 2.2.5.4.2, 2.2.5.4.1, 2.2.5.3, 2.2.5.2, 2.2.5.1, 2.2.5, 2.2.x

All Platforms

3.2, 3.1.0.2, 2.4.6, 2.2.6.1

Reclaim GC gem aborting while reclaiming dead can corrupt repository

There is a small risk that if a reclaim GC gem is reclaiming dead objects and does an abort in response to a sigAbort, internal tables tracking the dead objects will become corrupted. This can cause a wide range of various types of page and object level corruption, including page cache errors, object does not exist errors, and corrupted application data.

Workaround

We *highly* recommend that customers upgrade to a version containing the fix ASAP to avoid this bug (3.2, 3.1.0.2, 2.4.6, 2.2.6.1, or later).

You can reduce the likelihood of sending a reclaim GC gem a sigAbort by avoiding the following:

1.  Generating a CR backlog that exceeds the setting of the STN_CR_BACKLOG_THRESHOLD.  This is difficult to do in practice and the main reason why we encourage you to upgrade ASAP.  In the interim, you should develop a monitoring task that detects when the CR backlog threshold is exceeded and shuts down the reclaim GC gems.  See the workaround code for an example.

2.  Running any  garbage collection operation that reports new dead objects, such as MFC, markGcCandidates, and Epoch GC.  At the completion of these operations, a vote is taken on the possible dead, and sessions can be sent sigAborts to get their attention.

To safely run any garbage collection operations, you should either shutdown the reclaim GC gems or reconfigure them to not process dead objects.  They should remain in this state until the vote is completed and statmon stone stat GcVoteState returns to zero.

You can shut them down by doing:

1.  Run:  System stopAllReclaimGcSessions.

Restore by doing:

1.  Run:  System startAllReclaimGcSessions.

If you would like to keep the reclaim GC gems running to perform shadowed page reclaim but not process dead objects, do the following:

1.  Run:  System stopAllReclaimGcSessions.
2.  As GcUser:  set UserGlobals at: #reclaimDeadEnabled to false.
3.  Run:  System startAllReclaimGcSessions.

Restore by doing:

1.  Run:  System stopAllReclaimGcSessions.
2.  As GcUser:  set UserGlobals at: #reclaimDeadEnabled to true.
3.  Run:  System startAllReclaimGcSessions.

! BUG 42470 Support Code
!
! Example code for stopping Reclaim GC gems when CR backlog threshold
! is exceeded, and restarting them when it falls back below.
!
! Useful as a temporary workaround for part of bug 42470.
! Note that you also need to disable Epoch GC and disable
! reclaim GC gems when performing MFC (see bugnote for details).
!
! Note that depending on settings of sleepTime, shutdownOffset,
! and restartOffset and the behavior of your system,  you *still* could 
! have reclaim GC gems running when the CR backlog threshold is crossed 
! and risk hitting bug 42470.
!
run
| crb crbThreshold sleepTime reclaimRunning shutdownOffset restartOffset |

"Configure as appropriate for your site"
sleepTime := 1.  "in seconds"
shutdownOffset := 10. "to allow for lag time before reclaim gems can shutdown"
restartOffset :=  20.
reclaimRunning := true.

"Make us transactionless so we respond to sigAborts (if needed)"
System transactionMode: #transactionless.

"Get the stone configuration STN_CR_BACKLOG_THRESHOLD"
crbThreshold := System configurationAt: #StnCrBacklogThreshold.

[ true ] whileTrue: [
    System abortTransaction.
    crb := (System cacheStatistics: 1) at: 160.
    (reclaimRunning and: [crb > (crbThreshold - shutdownOffset)])
    ifTrue: [
        GsFile stdout log: 
            (DateTime now asString) , ': ' ,
            'CR backlog exceeded: shutting down reclaim gems'.
        System stopAllReclaimGcSessions.
        reclaimRunning := false ].
    ((reclaimRunning not) and: [crb < (crbThreshold - restartOffset)])
    ifTrue: [
        GsFile stdout log: 
            (DateTime now asString) , ': ' ,
            'CR backlog cleared: restarting reclaim gems'.
        System startAllReclaimGcSessions.
        reclaimRunning := true ].
    System sleep: sleepTime ].
%

Bug 42470

Reclaim GC gem aborting while reclaiming dead can corrupt repository

Workaround