Helix Failover

This is the home page of the Helix Failover system.

A Very Brief Failover History

The architecture of the Perforce Helix Versioning Engine has always included a robust internal real-time journaling mechanism. That core technology enabled development of High Availility (HA) and Disaster Recovery DR) solutions back in the 1990s. Over the years, enterprise customers took advantage of the built-in enabling technolgoy to develop comprehensive failover solutions.

Starting in 2009, there was a move by Perforce to make such endeavours easier, as described in this early 2010 blog article, Perforce 2009.2: High Availability & Disaster Recovery: Closer to Commodity.

Today in 2015, the technology is farther down the path of commoditizing failover.

Over the years, customers have developed custom failover solutions, at times engaging Perforce Consulting Services to develop custom solutions based on the Perforce Server Deployment Package, a functioning sample implementation of best practices for managing various Helix toplogy components (p4d, p4broker, Swarm, p4ftpd, p4dtg, p4web, etc).

This Helix Failover page is intended to give the sense of what a custom solution might like, based on past solutions. Despite significant advances in commiditzation of underlying technology, fully automated failover remains a a custom endavor.

What It Looks Like

<pre> Failover Mode:

p4failover.sh {-s|-u} {-ha|-dr|-local} -i <#> [-L <log>] [-si] [-v<n>] [-D]

or, Check Mode:

p4failover.sh -check [-fix|-FIX] [-i <#>] [-L <log>] [-v<n>] [-D]

or, Help Mode:

p4failover.sh [-h|-man]

OPTIONS: Check Mode

-check Specifies 'check' mode. In this mode, no failover will be executed. Instead, ssh checks are done to ensure that machines that could be involved in failover are accessible.

Check mode also reports on whether the 'p4.master' files
on the machines are in the expected state, i.e. 'true' on
the master server and 'false' on the backup.

-fix If '-fix' is specified and the 'p4.master' values are not as expected based on the p4failover.cfg file, then the p4failover.cfg file is updated to match the p4.master files.

If Helix server instances are up and running on the
"wrong" machine according to the instannce's p4failover.cfg
file, the '-fix' option will refuse to operate.  To force it,
use '-FIX' instead.

-i Specify a comma-delimited list of Helix server instances to check. Or, specify the special value 'ALL' (or 'all') to check all instances specified by P4F_ALL_INSTANCES in /p4/common/bin/p4failover.cfg. The '-i' argument is optional in Check Mode; omitting it is equivalent to specifying '-i ALL'.

Regardless of the instance specified, the main log for
p4failover.sh appears in /tmp.
Supplemental logs may appear under specific instance log
directories.

OPTIONS: Failover Mode

-s Specifies a Scheduled Failover.

-u Specifies an Unscheduled Failover.

-ha Specifies an HA failover from the primary server to the HA backup server.

-dr Specifies a DR failover from the primary server to the DR backup server. Implies '-u'.

-local Specifies a failover on the primary server, but using Offline databases. Implies '-u'.

-i Specify a comma-delimited list of Perforce server instances to failover. Or, specify the special value 'ALL' (or 'all') to failover all instances specified by P4F_ALL_INSTANCES in /p4/common/bin/p4failover.cfg. Only instances mastered on the current machine (with a "true" value in /p4/x/bin/p4.master) are failed over. The '-i' argument is required in Failover Mode.

OPTIONS: General (applicable to Check and Failover modes)

-v<n> Set verbosity 1-5 (-v1 = quiet, -v5 = highest).

-L <log> Specify the path to a log file, or the special value 'off' to disable logging. By default, all output (stdout and stderr) goes to: /tmp/p4failover.20150723-201210.log.

NOTE: This script is self-logging.  That is, output displayed on the
screen is simultaneously captured in the log file.  Do not run this
script with redirection operators like '> log' or '2>&1',
    and do not use it with 'tee.'

-si Operate silently. All output (stdout and stderr) is redirected to the log only; no output appears on the terminal. This cannot be used with '-L off'.

-D Set extreme debugging verbosity.

HELP OPTIONS: -h Display short help message -man Display man-style help message

DESCRIPTION:

Failover is about executing a transition of the Perforce service, to minimize downtime of Perforce. There are different types and modes of failover which apply in different failure scenarios.

The goals are generally to minimize both downtime and data loss in a variety of failure scenarios.

Failover Modes:

There are two failover modes: Scheduled and Unscheduled.

In Scheduled failover, all server mcachines are assumed to be online and operating nomrally, and all Perforce databases healthy. Perforce is shut down and the service is transitioned smoothly.

An Unscheduled failover is the result of something going wrong, such as power failures that might corrupt databases, a hardware failure on the primary server machine, or a disaster scenario that affects an entire site.

Failover Types:

There are three types of Failover: Local, HA, and DR.

A Local failover is a keeps the Perofrce service on the same machine, and simply makes use of offline databases. For example, say there is a sudden power failure that shuts down both the Master and Backup server machines. Power is restored 10 minutes later. There is no reason to suspect that primary server hardware is damaged, but there is risk the live databases might be corrupt due to the sudden power loss. A Local Failover can be executed in this case.

In Local Failover, live databases are moved aside, and the offline databases are moved into the live directory and refreshed with the latest journal file prior to starting p4d.

In a High Availabilty (HA) Failover, Perforce is restarted on the Backup server machine. This can be Scheduled (e.g. for planned maintenance, like adding more RAM) or Unscheduled (e.g. CPU failure).

So long as all hardware is in good working order, it is possible to execute HA Failover bi-directionally between the Primary server and the Backup server, which reverses roles after an HA failover.

In a Disaster Recovery (DR) Failover, the Perforce service is transitioned to a Disaster Recovery site. This can only be Unscheduled.

A DR failover is a one-way transition. Returning to service after a DR failover is a manual process, outside the scope of this p4failover.sh command.

CONFIG FILES:

Instance specific p4failover.n.cfg file define the roles of various hosts in the deployment architecture for that instance. A global p4failover.cfg in /p4/common/bin/ file defines configuration common to all instances.

This script uses a central configuration, /p4/common/config/p4failover.cfg, and a set of instance-specific config files /p4/common/config/p4failover.n.cfg.

The central config files defines global settings. One setting is the definitaion of which Perforce server instances are affected when '-i ALL' is specified.

UNDER THE HOOD:

This script operates from the any of several key server machine, and uses ssh commands to communicate with other Perforce server machines.

USAGE EXAMPLES:

Basic system sanity check: p4failover.sh -check

Planned HA Failover to allow hardware maintenance on the primary server: p4failover.sh -s -ha -i all

After a power out, if the primary server hardware is deemed OK, a Local Failover should be done due to risk of potential data corruption: -u -local -i all

After a CPU failure on the Primary server machine, an Unscheduled HA Failover can be initiated: -u -ha -i all

After decision to execute a DR failover (assuming the broker machine survives the calamity): p4failover.sh -u -dr -i all

SYSTEM DEPENDENCIES:

Following are dependencies of this system:

The Perforce Server Deployment Package (SDP) must be configured for HA installed on key machines.
A system to keep the broker.conf files updated on the primary and backup machines.
SSH keys for the 'perforce' user must exist between any machines involved in the failover. This includes:
- Master Perorce Server
- Backup Perorce Server
- Perforce DR Server </pre>

About

Helix Failover

A Very Brief Failover History

What It Looks Like