Perforce Helix Management System (HMS)
===

Introduction
---

As Perforce Helix evolves to meet ever-growing enterprise demands, sophisticated global deployment architectures have become commonplace.

There has been a corresponding rise in deployment complexity as more Helix installations take advantage of:

* Proxies
* Brokers
* Replicas
* Replication Filtering
* Edge servers
* Swarm
* Git Fusion
* P4Search
* P4DTG
* P4Web
* Custom automation like CBD (built largely on the broker)

Helix Management System to the Rescue
---

That's a lot of complexity to manage!  Fear not!  The Helix Server and the Server Deployment Package (SDP) are already for this purpose in custom.  The Helix Management System evolves and codifies manual practices currently used by Perforce Consultants to help customers manage sophisticated enterprise environments.

HMS Goals
---

Simply put:  Routine Helix administration tasks should be consistent and simple.

HMS will help with:

* **Knowing What you Have** - Knowing what components and what versions of all the various Helix components exist in your topology.  HMS v1.0 will not do form of automated *discovery* of the topology, but will provide a well defined way of *defining* and tracking all components in use, where they are, and various details of how they are configured.

* **Consistent Start/Stop/Status** - Managing various Helix Server instances in your environment, including system start, stop, and "Down for Maintenance" modes.  The mechanical steps to start and stop a Helix Server can vary, based (for example) on whether or not a broker is in place for that instance, whether or not SSL is enabled, etc. HMS abstracts those details to get it down to Start/Stop/Status.

* **Health Status Ping** - For each Helix topology component, we'll quickly see if it is running or not for v1.0.  (Later, this may expand to include aspects of health monitoring, going beyond mere up/down status).

* **Upgrading** - Upgrades in a global topology are straightforward and well understood, but there are a lot of moving parts and a lot of commands to type to make it happen.  HMS will make this easy by knowing which components have a newer version available and/or a newer patch of the same version available.  It will be easy to upgrade all Helix topology components, individually or in groups, with a click of a button.  In sophisticated topologies involving edge servers and replicas, there will be a built-in awareness of the order in which components must be upgraded, without relying on a human admin to know those details.
 
When it comes to topology-wide upgrades, enterprises need it all -- control, flexibility (which introduces a degree of complexity), and operational simplicity.  They may want to apply a P4D patch to one instance but not others, or upgrade all instances at once.  We can present admins with options rather than have them figure out custom upgrade procedures for updating executables and symlinks tweaking to get it right.

* **Human-Initiated Failover** - HMS will execute the steps to achieve a Failover by calling a single `p4_failover.sh` script.

Stretch Goal:  Failover addresses mechanics comprehensively, even including such things that are often outside the scope of Perforce administrators, but which are truly necessary to achieve failover in some environments.  Things like DNS updates, Virtual IP configuration changes, etc.

Note that *fault detection* and *automated failover initiation* are explicitly outside the scope of this project.  This project's more humble goal is simply to clarify and simplify the mechanics of executing a pre-defined failover path.

* **Comprehending the Parts** - Knowing every detail will help Perforce admins understand the many moving parts that keep a Helix environment happy.  Details like:

  * Seeing OS and hardware details
  * Knowing Crontab files for key users on every machines
  * Checking firewall configurations to ensure key ports are open
  * Knowing what OS user accounts the server products run as
  * Knowing what LDAP servers are used
  * Knowing what ports each server process is listening on
  * Capturing `p4 info -s` from each Helix Server instance

All this and much more is needed and should be visible from a source more dynamic and reliably updated than a human-maintained wiki page.  The data will be gathered centrally for the human administrator, but kept current automatically.

HA/DR and Failover
---
With Helix Server 2015.2, High Availability (HA) and Disaster Recovery (DR) solutions are closer to being commoditized than ever before.  But it's still not quite commodity.  HMS will capture and codify what Perforce Consultants have done for individual customers in the past with custom solutions, automating all the wiring under a big red Failover button.

The HMS Failover system will provide the following options for failing over:

* **Scheduled vs. Unscheduled Failover** - *Scheduled failover* is a planned event, not a reaction to a problem.  In a scheduled failover, assumptions can safely be made about the state of the things.  This might occur, for example, to allow master Server A to be powered down for several hours to add RAM, with Server B coming online to avoid downtime of the Helix ecosystem for more than a few minutes.  Nothing is broken, so this type of failover can be nearly transparent. *Unscheduled failover* occurs as a decision by a human administrator, in reaction to something breaking.

* **Local Failover** - *Local failover* is a failover to an offline copy of the Perforce databases on the same machine.  This is useful for scenarios where the database integrity is in question for some reason, but there's no reason to suspect the hardware is damaged.  For example, this might be the case after a sudden power loss, or human admin error (removing live databases by accident).

* **HA Failover** - HA failover involves failover over to another server in the same data center, optionally sharing storage with the master for archive files.  Little or no data loss is acceptable for an HA failover, and downtime should be minimal.

* **DR Failover** - DR failover involves failing over to another data center.  Some data loss is expected, and higher downtime is deemed acceptable (as it is unavoidable).  DR failover can be further classified as:
  * Short-Range - DR machine is nearby to the master, perhaps in a building across the street.  This defends against a fire in one building, and being close, can be expected to have minimal data loss.
  * Medium Range - DR machine is in the same metro-area as the master, ideally a different power grid.  This provides a mix of minimal data loss and good protection against regional disasters, like large files or floods.
  * Long-Range - DR machine is very far from the master, across the continent or even across the planet.  This is is intended to provide business continuity in the face of terrible disasters, such as wars, large-scale earthquakes, and perhaps small asteroid strikes.

HMS Product Road Map
===

Version | Target Timeframe | Features Included/Excluded
------- | ---------------- | ---------------------------
1.0     | Late Q3 2016     | SDP Version Management, Topology Definition, Failover Support for **p4d**, **p4broker**, Replication Normalization, Simplified Replica/Edge Server Setup
1.1     | Q4 2016    | Failover Support for **Swarm**
1.2     | Q4 2016    | Helix Component Status Visualization
1.3     | Q4 2016    | Status Monitoring
1.5     | Unscheduled      |Failover Support for **Git Fusion**

Feature Descriptions
---

* **SDP Version Management** - A necessary step on the path to successful automation across multiple machines is to solve basic version management challenges for the SDP itself.  Essentially, Perforce itself is used to version the SDP on all HMS hosts -- hosts on which any Perforce server process executes, be it a proxy, broker, master server, Swarm, P4Search, etc.  The familiar SDP `/p4/common` structure is managed on all hosts from a single server, the HMS server.  That server in turn uses Helix DVCS features to vastly simplify the process of updating the SDP from The Workshop, taking advantage of Perforce merging power to merge updates to the stock public SDP into the local environment, which will have customizations versioned locally.

* **Topology Definition** - HMS provides a data format to describe various aspects of Helix server components in use.

* **Failover Support for ...** - HMS will provide Scheduled and Unscheduled Failover for the specified components.

* **Replication Normalization** - HMS will not support all possible replication options.  Part of its value is that it will only the most useful replica types and usage options, based on Consulting experience working with customers, distilling the myriad potential options (all of which are useful in some context).  The normalization includes defining naming conventions for naming server specs that readily convey their usage.

* **Simplified Replica/Edge Server Setup** - All the complexities of creating and seeding a replica are distilled down to a command something like this example, to create a forwarding replica:
  `mkrep.sh -t fr -s bos -r bos-helix-02`

HMS System Components (for HMS 1.0)
===

* SDP, on which HMS builds.  The SDP includes scripts to manage and synchronize many small details -- synchronizing of the initialization scripts, ensuring no unversioned changes to backup scripts go unnoticed, tracking changes to crontab files for the perfoce user, and much more.  This provides greater visibility to all the tiny details that matter and need to be considered when doing an upgrade.  Things like disabling crontabs for backups during mainteancne windows, and turning them back on later.

* A dedicated Helix Server that models the global topology, and knows every machine that contains Helix topology components of any kind, and knows about all the Helix Server _instances_ (or _data sets_).  The server will maintain an internal map of the global topology that can be updated from various Helix instances, each of which is aware of its own topology (maintained with **p4 server** specs).

* A broker, which adds new commands, e.g. **p4 hms start _fgs_**, **p4 hms failover _fgs_ -ha**', **p4 hms main start _fgs_**, and more).


* SSH Keys need to work.

Custom Extension
===
Failover in an enterprise environment may always involve some degree of customization.  HMS will capture everything that can be understood based on how the various Perforce software technologies work, and provide clear injection points for handling things likely to be topology-specific, such as redirecting traffic via DNS or Virtual IP changes.

Project Status
===
This Project has started active development in mid 2016.  As of August 4, 2016, it is in the design and early development phase.