SDP-341

cgeen
Closed
Critical recreate_db_checkpoint.sh bug with shared /hxdepots shared.

This bug won't impact many customers due to it involving an unlikely
sequence of events. But the impact is high if it hits, and addressing
the issue is critical.

== Background ==

This bug has been in the SDP since 2016.2.21193 (December 2, 2016),
and affects versions between 2016.2.21193 (December 2, 2016) and
2018.1.23583 (2018/02/08), inclusive.  Versions older or newer are
unaffected, and the Windows SDP is unaffected.

The issue is in script the recreate_db_checkpoint.sh.  The default SDP
crontab calls this script only twice a year, and is sometimes disabled
entirely as it is optional.  The script replaces live databases in
P4ROOT with fresh, regenerated-from-a-checkpoint databases from the
offline_db tree maintained by the SDP.

The default crontab calls recreate_db_checkpoint.sh twice per year, on
the first Saturday in January and July at 6:05 PM on the master
server's time zone.

The issue only occurs when the following are true:
* The storage volume used for archive files is shared (e.g. via NFS
or SAN) across a master and its HA server.
* A failover from the master server to the HA replica has been done.
* The recreate_db_checkpoint.sh script runs accidentally on the
out-of-commission master (e.g. via a cron that everyone forgot about
still running on the old master).

The negative impact occurs after a failover-then-failback situation,
when the script is run on the old master, but (due to shared storage)
rotates database symlinks on the new master server machine.

It is not likely to hit many customers, but when it does, the impact
is an outage and needing to recover from a checkpoint and journal.
(Luckily, those are always available with the SDP).

=== A QUICK FIX ===

Customers should DELETE these two scripts from the installation:

/p4/common/bin/recreate_db_checkpoint.sh
/p4/common/bin/recreate_db_sync_replica.sh

Then remove any calls to these two scripts to it in the crontab of the
OS account under which Perforce runs on any and all Perforce server
machines.  This OS account is typically 'perforce' or 'p4admin'.

If you are not comfortable with the SDP, this is a fast, safe, easy
fix.  It only requires login access to the machine and OS file
permissions sufficient to delete the scripts.  It can be applied
immediately by anyone with login access to the Perforce server machine.
It does NOT require an SDP update.

After making this change, the HA replica that shares archvies with
its master server must be reseeded from the latest checkpoint on
the master.

This quick fix will remove the capability to occasionally replace live
databases with fresh ones regenerated from a checkpoint.  That
functionality is non-critical to most customers.

=== THE QUICK SDP PATCH ===

A quick SDP patch has been be release that simply deletes this script
and references to it in the crontab and documentation.  (A fixed
version of the script will likely re-appear in a future release).

=== A BETTER, MORE SOPHISTICATED FIX ===

For customers who want to preserve the capability to routinely replace
live databases with fresh ones regenerated from a checkpoint, a
workaround can be done by making a change to the SDP structure rather
than deleting the recreate_db_checkpiont.sh script.

The solution outlined below has been proven to work.  If you are
comfortable with the SDP, this is the best fix.

Details:  Since the early days of the SDP in 2007, it has been
structured so that the /p4 directory was on the root volume (/), and
the individual SDP instance-specific directories, e.g. /p4/1, were on
the storage volume used for archive files (often named /hxdepots or
/depotdata, but can be different at any given customer site).  The
instance-specific directories contained a mix of regular directories
(for things stored on the archive files volume) and symlinks.

To fix this issue, restructure it so that the /p4 directoryand
instance-specific directories like /p4/1 are ALL on the root
volume (/).  The instance-specific directories contains only
symlinks and .p4tickets/.p4trust files in this structure.

This fix can be applied manually, and does not require an SDP
upgrade.  Further, it will work with future versions of the
SDP, as this structural change to the symlink and directory
structure of /p4 and /p4/N directories was on track to be
included in a future release of the SDP for performance reasons
prior to detection of this bug.  (The performance benefit is
ensuring that access to latency-sensitive /p4/N/root does
not pay a high latency tax going thru a /p4/N symlink on a
shared storage volume).

=== FUTURE FIX ===

A future SDP release will provide a fix that preserves the
capability to routinely replace live databases with fresh ones
regenerated from a checkpoint.  Customers will need to update to
the latest SDP to get the new version when it is available.
Status
Closed
Project
perforce-software-sdp
Severity
A
Reported By
cgeen
Reported Date
Modified By
tom_tyler
Modified Date
Owned By
tom_tyler
Component
core-unix
Type
Bug