README.md #10

Perforce Helix Management System (HMS)

News

Latest News - March 5, 2021

HMS continues modest evolution.

Update - July 24, 2019

The HMS refactoring is complete and the car is back in the garage!

Update - June 18, 2019

The HMS has been in "car apart in the garage" state due to changes related to refactoring HMS, splitting it from the SDP, back into a separate project (which is actually how it started before being blended into the SDP for a time). From a deployment perspective, HMS is deployed as a layered product on top of the SDP (requiring SDP 2019.2+). From a consumption perspective, the HMS and SDP are separate products.

Customers using the Helix native DVCS features to fetch new versions of the SDP from the Workshop (e.g. to support a 'fetch' and 'merge' process for pulling vanilla SDP into a potentially custom local implementation) will need a new procedure. The new procedure will fetch both HMS and the SDP from separate areas in the Public Depot, while consolidating them locally for deployment. This updated procedure will be documented soon, and referenced on this page.

Update - March 7, 2019

News Flash: The HMS Project is being extracted from the SDP project, to once again become a standalone project with its own roadmap.

Introduction

As Perforce Helix evolves to meet ever-growing enterprise demands, sophisticated global deployment architectures have become commonplace.

There has been a corresponding rise in deployment complexity as more Helix installations take advantage of:

Proxies
Brokers
Replicas
Replication Filtering
Edge servers
Helix Swarm
P4DTG
Git Fusion
Custom automation like CBD (built largely on the broker)

Helix Management System to the Rescue

That's a lot of complexity to manage! Fear not! The Helix Versioning Engine and the Server Deployment Package (SDP) are well suited for this purpose. The Helix Management System evolves and codifies manual best practices used by Perforce Consultants and enterprise site admins to help customers manage sophisticated enterprise environments.

HMS Goals

Simply put: Routine Helix administration tasks should be consistent and simple.

HMS will help with:

Knowing What you Have - Knowing what components and what versions of all the various Helix components exist in your topology. HMS v1.0 will not do form of automated discovery of the topology, but will provide a well defined way of defining and tracking all components in use, where they are, and various details of how they are configured.
Consistent Start/Stop/Status - Managing various Helix Server instances in your environment, including system start, stop, and "Down for Maintenance" modes. The mechanical steps to start and stop a Helix Server can vary, based (for example) on whether or not a broker is in place for that instance, whether or not SSL is enabled, etc. HMS abstracts those details to get it down to Start/Stop/Status.
Health Status Ping - For each Helix topology component, we'll quickly see if it is running or not for v1.0. (Later, this may expand to include aspects of health monitoring, going beyond mere up/down status).
Upgrading - Upgrades in a global topology are straightforward and well understood, but there are a lot of moving parts and a lot of commands to type to make it happen. HMS will make this easy by knowing which components have a newer version available and/or a newer patch of the same version available. It will be easy to upgrade all Helix topology components, individually or in groups, with a click of a button. In sophisticated topologies involving edge servers and replicas, there will be a built-in awareness of the order in which components must be upgraded, without relying on a human admin to know those details.

When it comes to topology-wide upgrades, enterprises need it all -- control, flexibility (which introduces a degree of complexity), and operational simplicity. They may want to apply a P4D patch to one instance but not others, or upgrade all instances at once. We can present admins with options rather than have them figure out custom upgrade procedures for updating executables and symlinks tweaking to get it right.
Human-Initiated Failover - HMS will execute the steps to achieve a Failover by executing a single command hms failover.

Stretch Goal: Failover addresses mechanics comprehensively, even including such things that are often outside the scope of Perforce administrators, but which are truly necessary to achieve failover in some environments. Things like DNS updates, Virtual IP configuration changes, etc.

Hardware fault detection and automated failover initiation are explicitly outside the scope of this project. This project's more humble goal is simply to clarify and simplify the mechanics of executing a failover. That said, these are necessary first steps to those loftier goals.

Comprehending the Parts - Knowing every detail will help Perforce admins understand the many moving parts that keep a Helix environment happy. Details like:
- Seeing OS and hardware details
- Knowing Crontab files for key users on every machines
- Checking firewall configurations to ensure key ports are open
- Knowing what OS user accounts the server products run as
- Knowing what LDAP servers are used
- Knowing what ports each server process is listening on
- Capturing p4 info -s from each Helix Server instance

All this and much more is needed and should be visible from a source more dynamic and reliably updated than a human-maintained wiki page. The data will be gathered centrally for the human administrator, but kept current automatically.

Failover Options

With Helix Server 2015.2, High Availability (HA) and Disaster Recovery (DR) solutions are closer to being commoditized than ever before. But it's still not quite commodity. HMS captures and codifies what Perforce Consultants have done for individual customers with custom solutions, automating all the wiring under a big red Failover button.

A set of pre-defined, pre-configured failover options are defined with HMS. At the time of execution, the administrator must select from a short list of options to execute a failover. Based on the type of option selected (Local, HA, DR), failover will occur to a pre-defined target machine for that type of failover.

Failover Classification: Planned vs. Unplanned Failover

Planned Failover - Planned failover is a planned, generally scheduled event, not a reaction to a problem. In a planned failover, assumptions can safely be made about the state of the things. This might occur, for example, to allow master Server A to be powered down for several hours to add RAM, with Server B coming online to avoid downtime of the Helix ecosystem for more than a few minutes. Nothing is broken, so this type of failover can be nearly transparent.

Unscheduled failover occurs as a decision by a human administrator, in reaction to something breaking. The human administrator must determine the nature of the problem, and determine if failover is needed, and if so, what failover option is best.

Failover Options

Following are the list of potential failover options that can be configured:

Local Failover - Local failover is a failover to an offline copy of the Perforce databases on the same machine. This is useful for scenarios where the database integrity is in question for some reason, but there's no reason to suspect the hardware is damaged. For example, this might be the case after a sudden power loss, or error on the part of a human administrator (like removing live databases by accident -- yes, it happens to the best of us).
HA Failover - HA failover involves failover over to another server in the same data center, optionally sharing storage with the master for archive files. Little or no data loss is acceptable for an HA failover, and downtime should be minimal.
DR Failover - DR failover involves failing over to another data center. Some data loss is expected, and higher downtime is deemed acceptable (as it is unavoidable). DR failover can be further classified as:
- Short-Range - DR machine is nearby to the master, perhaps in a building across the street. This defends against a fire in one building, and being close, should have the lowest risk of data loss.
- Medium Range - DR machine is in the same metro-area as the master, ideally a different power grid. This provides a mix of minimal data loss and good protection against regional disasters, like large files or floods.
- Long-Range - DR machine is very far from the master, across the continent or even across the planet. This is is intended to provide business continuity in the face of terrible disasters, such as wars, large-scale earthquakes, and perhaps small asteroid strikes.

HMS Product Road Map

See: HMS Product Road Map.md.

HMS v1.0 System Components

See: SystemComponents.md.

Custom Extension

Failover in an enterprise environment may always involve some degree of customization. HMS will capture everything that can be understood based on how the various Perforce software technologies work. Injection points can be identified for handling things likely to be topology-specific, such as redirecting traffic via DNS, integration with monitoring systems, or tying into failover of other related systems.

Support

This software is community supported. Evolution can also be driven by engaging Perforce Consulting. Please DO NOT contact Perforce Support for the Helix Management System, as it is not an officially supported product offering.

Project Status and Brief History

This project started as a standalone package layered on the SDP. Near the time of the first production release circa September 28, 2016, HMS was blended into the SDP for a time, as it drove many SDP changes and was tightly coupled for a stretch. After changes in 2019, it is one again a standalone project.

# Perforce Helix Management System (HMS)

# News

## Latest News - March 5, 2021

HMS continues modest evolution.

## Update - July 24, 2019

The HMS refactoring is complete and the car is back in the garage!

## Update - June 18, 2019

## Update - March 7, 2019

**News Flash:** The HMS Project is being extracted from the [SDP project](https://swarm.workshop.perforce.com/projects/perforce-software-sdp), to once again become a standalone project with its own roadmap.

# Introduction

As Perforce Helix evolves to meet ever-growing enterprise demands, sophisticated global deployment architectures have become commonplace.

There has been a corresponding rise in deployment complexity as more Helix installations take advantage of:

* Proxies
* Brokers
* Replicas
* Replication Filtering
* Edge servers
* Helix Swarm
* P4DTG
* Git Fusion
* Custom automation like CBD (built largely on the broker)

# Helix Management System to the Rescue

That's a lot of complexity to manage! Fear not! The [Helix Versioning Engine](https://www.perforce.com/helix#versioning-engine) and the [Server Deployment Package (SDP)](https://swarm.workshop.perforce.com/projects/perforce-software-sdp) are well suited for this purpose. The Helix Management System evolves and codifies manual best practices used by Perforce Consultants and enterprise site admins to help customers manage sophisticated enterprise environments.

# HMS Goals

Simply put: Routine Helix administration tasks should be consistent and simple.

HMS will help with:

* **Knowing What you Have** - Knowing what components and what versions of all the various Helix components exist in your topology. HMS v1.0 will not do form of automated *discovery* of the topology, but will provide a well defined way of *defining* and tracking all components in use, where they are, and various details of how they are configured.

* **Consistent Start/Stop/Status** - Managing various Helix Server instances in your environment, including system start, stop, and "Down for Maintenance" modes. The mechanical steps to start and stop a Helix Server can vary, based (for example) on whether or not a broker is in place for that instance, whether or not SSL is enabled, etc. HMS abstracts those details to get it down to Start/Stop/Status.

* **Health Status Ping** - For each Helix topology component, we'll quickly see if it is running or not for v1.0. (Later, this may expand to include aspects of health monitoring, going beyond mere up/down status).

* **Upgrading** - Upgrades in a global topology are straightforward and well understood, but there are a lot of moving parts and a lot of commands to type to make it happen. HMS will make this easy by knowing which components have a newer version available and/or a newer patch of the same version available. It will be easy to upgrade all Helix topology components, individually or in groups, with a click of a button. In sophisticated topologies involving edge servers and replicas, there will be a built-in awareness of the order in which components must be upgraded, without relying on a human admin to know those details.

When it comes to topology-wide upgrades, enterprises need it all -- control, flexibility (which introduces a degree of complexity), and operational simplicity. They may want to apply a P4D patch to one instance but not others, or upgrade all instances at once. We can present admins with options rather than have them figure out custom upgrade procedures for updating executables and symlinks tweaking to get it right.

* **Human-Initiated Failover** - HMS will execute the steps to achieve a Failover by executing a single command `hms failover`.

Stretch Goal: Failover addresses mechanics comprehensively, even including such things that are often outside the scope of Perforce administrators, but which are truly necessary to achieve failover in some environments. Things like DNS updates, Virtual IP configuration changes, etc.

*Hardware fault detection* and *automated failover initiation* are explicitly outside the scope of this project. This project's more humble goal is simply to clarify and simplify the mechanics of executing a failover. That said, these are necessary first steps to those loftier goals.

* **Comprehending the Parts** - Knowing every detail will help Perforce admins understand the many moving parts that keep a Helix environment happy. Details like:

* Seeing OS and hardware details
* Knowing Crontab files for key users on every machines
* Checking firewall configurations to ensure key ports are open
* Knowing what OS user accounts the server products run as
* Knowing what LDAP servers are used
* Knowing what ports each server process is listening on
* Capturing `p4 info -s` from each Helix Server instance

Failover Options
---
With Helix Server 2015.2, High Availability (HA) and Disaster Recovery (DR) solutions are closer to being commoditized than ever before. But it's still not quite commodity. HMS captures and codifies what Perforce Consultants have done for individual customers with custom solutions, automating all the wiring under a big red Failover button.

Failover Classification: Planned vs. Unplanned Failover
---

**Planned Failover** - *Planned failover* is a planned, generally scheduled event, not a reaction to a problem. In a planned failover, assumptions can safely be made about the state of the things. This might occur, for example, to allow master Server A to be powered down for several hours to add RAM, with Server B coming online to avoid downtime of the Helix ecosystem for more than a few minutes. Nothing is broken, so this type of failover can be nearly transparent.

**Unscheduled failover** occurs as a decision by a human administrator, in reaction to something breaking. The human administrator must determine the nature of the problem, and determine if failover is needed, and if so, what failover option is best.

Failover Options
---

Following are the list of potential failover options that can be configured:

* **Local Failover** - *Local failover* is a failover to an offline copy of the Perforce databases on the same machine. This is useful for scenarios where the database integrity is in question for some reason, but there's no reason to suspect the hardware is damaged. For example, this might be the case after a sudden power loss, or error on the part of a human administrator (like removing live databases by accident -- yes, it happens to the best of us).

* **HA Failover** - HA failover involves failover over to another server in the same data center, optionally sharing storage with the master for archive files. Little or no data loss is acceptable for an HA failover, and downtime should be minimal.

* **DR Failover** - DR failover involves failing over to another data center. Some data loss is expected, and higher downtime is deemed acceptable (as it is unavoidable). DR failover can be further classified as:
* Short-Range - DR machine is nearby to the master, perhaps in a building across the street. This defends against a fire in one building, and being close, should have the lowest risk of data loss.
* Medium Range - DR machine is in the same metro-area as the master, ideally a different power grid. This provides a mix of minimal data loss and good protection against regional disasters, like large files or floods.
* Long-Range - DR machine is very far from the master, across the continent or even across the planet. This is is intended to provide business continuity in the face of terrible disasters, such as wars, large-scale earthquakes, and perhaps small asteroid strikes.

# HMS Product Road Map

See: [HMS Product Road Map.md](https://swarm.workshop.perforce.com/projects/perforce_software-hms/files/main/RoadMap.md).

# HMS v1.0 System Components

See: [SystemComponents.md](https://swarm.workshop.perforce.com/projects/perforce_software-hms/files/main/SystemComponents.md).

# Custom Extension

# Support

# Project Status and Brief History

#	Change	User	Description
#10	27574	C. Thomas Tyler	Released HMS 2021.1.27572 (2021/03/05).
#9	25879	C. Thomas Tyler	Released HMS 2019.1.25877 (2019/07/24).
#8	25719	C. Thomas Tyler	Released HMS 2019.1.25712 (2019/06/18).
#7	25438	C. Thomas Tyler	Updated README.md and RoadMap pages.
#6	25290	C. Thomas Tyler	Rev. HMS/Linux/2016.1/20740 (2016/09/28).
#5	20741	C. Thomas Tyler	Published note about blending into the SDP.
#4	20080	C. Thomas Tyler	Adjusted URLs.
#3	20079	C. Thomas Tyler	Updated and refactored.
#2	20041	C. Thomas Tyler	Lots of updates; added preliminary roadmap.
#1	19373	C. Thomas Tyler	Populate -o //guest/tom_tyler/hms/main/README.md //guest/perforce_software/hms/main/README.md.
//guest/tom_tyler/hms/main/README.md
#1	17234	C. Thomas Tyler	Added README file for Helix Management System