Perforce Helix Management System (HMS) === Introduction --- As Perforce Helix evolves to meet ever-growing enterprise demands, sophisticated global deployment architectures have become commonplace. There has been a corresponding rise in deployment complexity as more Helix installations take advantage of: * Proxies * Brokers * Replicas * Replication Filtering * Edge servers * Swarm * Git Fusion * P4Search * P4DTG * P4Web * Custom automation like CBD (built largely on the broker) Helix Management System to the Rescue --- That's a lot of complexity to manage! Fear not! The [Helix Versioning Engine](https://www.perforce.com/helix#versioning-engine) and the [Server Deployment Package (SDP)](https://swarm.workshop.perforce.com/projects/perforce-software-sdp) are well suited for this purpose. The Helix Management System evolves and codifies manual best practices used by Perforce Consultants and enterprise site admins to help customers manage sophisticated enterprise environments. HMS Goals --- Simply put: Routine Helix administration tasks should be consistent and simple. HMS will help with: * **Knowing What you Have** - Knowing what components and what versions of all the various Helix components exist in your topology. HMS v1.0 will not do form of automated *discovery* of the topology, but will provide a well defined way of *defining* and tracking all components in use, where they are, and various details of how they are configured. * **Consistent Start/Stop/Status** - Managing various Helix Server instances in your environment, including system start, stop, and "Down for Maintenance" modes. The mechanical steps to start and stop a Helix Server can vary, based (for example) on whether or not a broker is in place for that instance, whether or not SSL is enabled, etc. HMS abstracts those details to get it down to Start/Stop/Status. * **Health Status Ping** - For each Helix topology component, we'll quickly see if it is running or not for v1.0. (Later, this may expand to include aspects of health monitoring, going beyond mere up/down status). * **Upgrading** - Upgrades in a global topology are straightforward and well understood, but there are a lot of moving parts and a lot of commands to type to make it happen. HMS will make this easy by knowing which components have a newer version available and/or a newer patch of the same version available. It will be easy to upgrade all Helix topology components, individually or in groups, with a click of a button. In sophisticated topologies involving edge servers and replicas, there will be a built-in awareness of the order in which components must be upgraded, without relying on a human admin to know those details. When it comes to topology-wide upgrades, enterprises need it all -- control, flexibility (which introduces a degree of complexity), and operational simplicity. They may want to apply a P4D patch to one instance but not others, or upgrade all instances at once. We can present admins with options rather than have them figure out custom upgrade procedures for updating executables and symlinks tweaking to get it right. * **Human-Initiated Failover** - HMS will execute the steps to achieve a Failover by calling a single `p4_failover.sh` script. Stretch Goal: Failover addresses mechanics comprehensively, even including such things that are often outside the scope of Perforce administrators, but which are truly necessary to achieve failover in some environments. Things like DNS updates, Virtual IP configuration changes, etc. *Hardware fault detection* and *automated failover initiation* are explicitly outside the scope of this project. This project's more humble goal is simply to clarify and simplify the mechanics of executing a failover. * **Comprehending the Parts** - Knowing every detail will help Perforce admins understand the many moving parts that keep a Helix environment happy. Details like: * Seeing OS and hardware details * Knowing Crontab files for key users on every machines * Checking firewall configurations to ensure key ports are open * Knowing what OS user accounts the server products run as * Knowing what LDAP servers are used * Knowing what ports each server process is listening on * Capturing `p4 info -s` from each Helix Server instance All this and much more is needed and should be visible from a source more dynamic and reliably updated than a human-maintained wiki page. The data will be gathered centrally for the human administrator, but kept current automatically. Failover Options --- With Helix Server 2015.2, High Availability (HA) and Disaster Recovery (DR) solutions are closer to being commoditized than ever before. But it's still not quite commodity. HMS captures and codifies what Perforce Consultants have done for individual customers with custom solutions, automating all the wiring under a big red Failover button. A set of pre-defined, pre-configured failover options are defined with HMS. At the time of execution, the administrator must select from a short list of options to execute a failover. Based on the type of option selected (Local, HA, DR), failover will occur to a pre-defined target machine for that type of failover. Failover Classification: Planned vs. Unplanned Failover --- **Planned Failover** - *Planned failover* is a planned, generally scheduled event, not a reaction to a problem. In a planned failover, assumptions can safely be made about the state of the things. This might occur, for example, to allow master Server A to be powered down for several hours to add RAM, with Server B coming online to avoid downtime of the Helix ecosystem for more than a few minutes. Nothing is broken, so this type of failover can be nearly transparent. **Unscheduled failover** occurs as a decision by a human administrator, in reaction to something breaking. The human administrator must determine the nature of the problem, and determine if failover is needed, and if so, what failover option is best. Failover Options --- Following are the list of potential failover options that can be configured: * **Local Failover** - *Local failover* is a failover to an offline copy of the Perforce databases on the same machine. This is useful for scenarios where the database integrity is in question for some reason, but there's no reason to suspect the hardware is damaged. For example, this might be the case after a sudden power loss, or error on the part of a human administrator (like removing live databases by accident -- yes, it happens to the best of us). * **HA Failover** - HA failover involves failover over to another server in the same data center, optionally sharing storage with the master for archive files. Little or no data loss is acceptable for an HA failover, and downtime should be minimal. * **DR Failover** - DR failover involves failing over to another data center. Some data loss is expected, and higher downtime is deemed acceptable (as it is unavoidable). DR failover can be further classified as: * Short-Range - DR machine is nearby to the master, perhaps in a building across the street. This defends against a fire in one building, and being close, should have the lowest risk of data loss. * Medium Range - DR machine is in the same metro-area as the master, ideally a different power grid. This provides a mix of minimal data loss and good protection against regional disasters, like large files or floods. * Long-Range - DR machine is very far from the master, across the continent or even across the planet. This is is intended to provide business continuity in the face of terrible disasters, such as wars, large-scale earthquakes, and perhaps small asteroid strikes. HMS Product Road Map === See: [HMS Product Road Map.md](https://swarm.workshop.perforce.com/projects/perforce_software-hms/files/main/RoadMap.md). HMS v1.0 System Components === See: [SystemComponents.md](https://swarm.workshop.perforce.com/projects/perforce_software-hms/files/main/SystemComponents.md). Custom Extension === Failover in an enterprise environment may always involve some degree of customization. HMS will capture everything that can be understood based on how the various Perforce software technologies work, and provide clear injection points for handling things likely to be topology-specific, such as redirecting traffic via DNS or Virtual IP changes. Project Status === This Project has started active development in mid 2016. As of August 4, 2016, it is in the design and early development phase.