Bug 26972

Critical

GemStone/S

6.1.6, 6.1.5, 6.1.x, 6.0.2, 6.0.1, 6.0

All

6.2

Remote page cache termination can lead to cache corruption

A new mechanism was introduced in 6.0 to fix BUG#13329 (stone can
hang for up to 9 minutes if a machine with a remote page cache
crashes or is disconnected from the network).  This feature uses
the new system configuration parameter STN_REMOTE_CACHE_PGSVR_TIMEOUT
to determine how long to wait after attempting communication with a
remote page cache before deciding it's dead, killing all sessions
connected to that remote cache, and disconnecting the remote cache
from the stone.

But under rare conditions this recovery operation can leave the primary
cache on the stone machine in an inconsistent state.

This is indicated by the following entries in the stone log:

    Timeout waiting for response from cache pgsvr on host 10.20.30.40

    All gems on this remote cache will now be terminated.

    Session will be killed by stone, SessionId = 1
      reason = RepRDbfFailureHandler, remote shared cache lost.

    Session will be killed by stone, SessionId = 2
      reason = RepRDbfFailureHandler, remote shared cache lost.

    ....

followed by an error message indicating some type of page cache
inconsistency (there are many possible).

These errors will usually crash the stone before any damage is done
to persistent data in the repository.  But we recommend that a
page audit and object audit be run if possible just to be sure.

Note that this new mechanism is only activated under rare conditions
(network disconnect of the remote machine, crashing of the remote machine),
and that cache inconsistencies only seem to occur on a small number
of such cases.

Workaround

No workaround.  A page audit and object audit should be run to
confirm that no data corruption has occurred.


Last updated: 12/3/07