Helix Failover
===

This is the home page of the Helix Failover system.

A Very Brief Failover History
===

The architecture of the Perforce Helix Versioning Engine has always included
a robust internal real-time journaling mechanism.  That core technology
enabled development of High Availility (HA) and Disaster Recovery DR) solutions
back in the 1990s.  Over the years, enterprise customers took advantage of the
built-in enabling technolgoy to develop comprehensive _failover_ solutions.

Starting in 2009, there was a move by Perforce to make such endeavours easier, as described in this early 2010 blog article, [_Perforce 2009.2: High Availability & Disaster Recovery: Closer to Commodity_](http://www.perforce.com/blog/100121/perforce-20092-high-availability-disaster-recovery-closer-commodity).

Today in 2015, the technology is farther down the path of commoditizing
failover.

Over the years, customers have developed custom failover solutions, at times
engaging Perforce Consulting Services to develop custom solutions based on
the [Perforce Server Deployment Package](https://swarm.workshop.perforce.com/projects/perforce-software-sdp), a functioning sample implementation of best practices for managing various Helix toplogy components (p4d, p4broker, Swarm, p4ftpd, p4dtg, p4web, etc).

This Helix Failover page is intended to give the sense of what a custom solution might like, based on past solutions.  Despite significant advances in commiditzation of underlying technology, fully automated failover remains a a custom endavor.

What It Looks Like
===
<pre>
Failover Mode:

p4failover.sh {-s|-u} {-ha|-dr|-local} -i <#> [-L <log>] [-si] [-v<n>] [-D]

or, Check Mode:

p4failover.sh -check [-fix|-FIX] [-i <#>] [-L <log>] [-v<n>] [-D]

or, Help Mode:

p4failover.sh [-h|-man]

**OPTIONS: Check Mode**
 
-check	Specifies 'check' mode.  In this mode, no failover will be
 	executed.  Instead, ssh checks are done to ensure that
	machines that could be involved in failover are accessible.

	Check mode also reports on whether the 'p4.master' files
	on the machines are in the expected state, i.e. 'true' on
	the master server and 'false' on the backup.

 -fix	If '-fix' is specified and the 'p4.master' values are not
 	as expected based on the p4failover.cfg file, then the
	p4failover.cfg file is updated to match the p4.master files.

	If Helix server instances are up and running on the
	"wrong" machine according to the instannce's p4failover.cfg
	file, the '-fix' option will refuse to operate.  To force it,
	use '-FIX' instead.

 -i	Specify a comma-delimited list of Helix server instances to
	check.  Or, specify the special value 'ALL' (or 'all') to
	check all instances specified by P4F_ALL_INSTANCES in
	/p4/common/bin/p4failover.cfg.  The '-i' argument is optional in Check
	Mode; omitting it is equivalent to specifying '-i ALL'.

	Regardless of the instance specified, the main log for
	p4failover.sh appears in /tmp.
	Supplemental logs may appear under specific instance log
	directories.

**OPTIONS: Failover Mode**

 -s	Specifies a Scheduled Failover.

 -u	Specifies an Unscheduled Failover.

 -ha	Specifies an HA failover from the primary server to the HA backup
 	server.

 -dr	Specifies a DR failover from the primary server to the DR backup
 	server.  Implies '-u'.

 -local Specifies a failover on the primary server, but using Offline
 	databases. Implies '-u'.

 -i	Specify a comma-delimited list of Perforce server instances to
	failover.  Or, specify the special value 'ALL' (or 'all') to
	failover all instances specified by P4F_ALL_INSTANCES in
	/p4/common/bin/p4failover.cfg.  Only instances mastered on the
	current machine (with a "true" value in /p4/x/bin/p4.master)
	are failed over.  The '-i' argument is required in Failover
	Mode.

** OPTIONS: General (applicable to Check and Failover modes) **

 -v<n>	Set verbosity 1-5 (-v1 = quiet, -v5 = highest).

 -L <log>
	Specify the path to a log file, or the special value 'off' to disable
	logging.  By default, all output (stdout and stderr) goes to:
	/tmp/p4failover.20150723-201210.log.

	NOTE: This script is self-logging.  That is, output displayed on the
	screen is simultaneously captured in the log file.  Do not run this
	script with redirection operators like '> log' or '2>&1',
       	and do not use it with 'tee.'

-si	Operate silently.  All output (stdout and stderr) is redirected to
	the log only; no output appears on the terminal.  This cannot be
	used with '-L off'.

 -D     Set extreme debugging verbosity.

**HELP OPTIONS:**
 -h	Display short help message
 -man	Display man-style help message

**DESCRIPTION:**

Failover is about executing a transition of the Perforce service,
to minimize downtime of Perforce.  There are different types and
modes of failover which apply in different failure scenarios.

The goals are generally to minimize both downtime and data loss in
a variety of failure scenarios.

_Failover Modes_:

There are two failover modes: Scheduled and Unscheduled.

In _Scheduled_ failover, all server mcachines are assumed to be online
and operating nomrally, and all Perforce databases healthy. Perforce
is shut down and the service is transitioned smoothly.

An _Unscheduled_ failover is the result of something going wrong, such as
power failures that might corrupt databases, a hardware failure on the
primary server machine, or a disaster scenario that affects an
entire site.

_Failover Types_:

There are three _types_ of Failover: Local, HA, and DR.

A **Local** failover is a keeps the Perofrce service on the same machine,
and simply makes use of offline databases.  For example, say there is
a sudden power failure that shuts down both the Master and Backup
server machines.  Power is restored 10 minutes later.  There is no
reason to suspect that primary server hardware is damaged, but there
is risk the live databases might be corrupt due to the sudden power
loss.  A Local Failover can be executed in this case.

In Local Failover, live databases are moved aside, and the offline
databases are moved into the live directory and refreshed with the
latest journal file prior to starting p4d.

In a **High Availabilty (HA) Failover**, Perforce is restarted on the
Backup server machine.  This can be Scheduled (e.g. for planned
maintenance, like adding more RAM) or Unscheduled (e.g. CPU failure).

So long as all hardware is in good working order, it is possible to
execute HA Failover bi-directionally between the Primary server
and the Backup server, which reverses roles after an HA failover.

In a **Disaster Recovery (DR) Failover**, the Perforce service is transitioned
to a Disaster Recovery site.  This can only be Unscheduled.

A DR failover is a one-way transition.  Returning to service after a DR failover is a manual process, outside the scope of
this p4failover.sh command.

**CONFIG FILES:**

Instance specific p4failover.n.cfg file define the roles of various hosts
in the deployment architecture for that instance.  A global p4failover.cfg
in /p4/common/bin/ file defines configuration common to all instances.

This script uses a central configuration, /p4/common/config/p4failover.cfg, and a set of
instance-specific config files /p4/common/config/p4failover._n_.cfg.

The central config files defines global settings. One setting is the
definitaion of which Perforce server instances are affected
when '-i ALL' is specified.

**UNDER THE HOOD:**

This script operates from the any of several key server machine, and uses ssh commands to communicate with other Perforce server machines.

**USAGE EXAMPLES:**

Basic system sanity check:
   p4failover.sh -check

Planned HA Failover to allow hardware maintenance on the primary server:
   p4failover.sh -s -ha -i all

After a power out, if the primary server hardware is deemed OK,
a Local Failover should be done due to risk of potential data
corruption:
    -u -local -i all

After a CPU failure on the Primary server machine, an Unscheduled HA
Failover can be initiated:
    -u -ha -i all

After decision to execute a DR failover (assuming the broker machine survives the calamity):
   p4failover.sh -u -dr -i all

**SYSTEM DEPENDENCIES:**

Following are dependencies of this system:

* The Perforce Server Deployment Package (SDP) must be configured for HA  installed on key machines.
* A system to keep the broker.conf files updated on the primary and backup machines.
* SSH keys for the 'perforce' user must exist between any machines involved in the failover.  This includes:
 * Master Perorce Server
 * Backup Perorce Server
 * Perforce DR Server
</pre>