Perforce Helix Management System (HMS) === Introduction --- As Helix evolves to meet growing enterprise demands, sophisticated global deployment architectures have become commonplace. There has been a corresponding rise in deployment complexity as more Helix installations take advantage of: * Proxies * Brokers * Replicas and replication filtering * Edge servers * Swarm * Git Fusion * P4Web * P4DTG * Custom automation like CBD (built largely on the broker) Helix Management System to the Rescue --- That's a lot of complexlity to manage! Fear not! The Helix Server and the Server Deployment Package (SDP) are already for this purpose in custom. The Helix Management System evolves and codifies manual practices currently used by Perforce Consultants to help customers manage sophisticated enterprise environemnts. HMS Goals --- Simply put: Routine Helix administration tasks should be consistent and simple. HMS will help with: * **Knowing What you Have** - Knowing what components and what verserions of all the various Helix components exist in your topology. HMS v1.0 is not contemplating any form of automated _discovery_ of the topology, just a well defined way of defining and tracking all components in use. * **Consistent Start/Stop/Status** - Managing various Helix Server instances in your environment, including system start, stop, and "Down for Maintenance" modes. The mechanical steps to start and stop a Helix Server can vary, based on (for example) knowing whether or not a broker is in place for that instance, whether or not SSL is enabled, etc. HMS will abstract those details to get it down to Start/stop/Status. * **Health Status Ping** - For each Helix topology comonent, we'll want to quickly see if it is running or not for v1.0. (Later, this may expand to include aspects of health monitroing, going beyond mere up/down status). * **Upgrading** - Upgrades in a global toplogy are straightfoward and well understood, but there are a lot of moving parts and a lot of commands to type to make it happen. HMS will endeavour to make this easy by knowing which components have a newer version available and/or a newer patch of the same version available. Later make it easy to upgrade all Helix topology components, individually or in groups, with a click of a button. In sophisticated topologies involving edge servers and replicas, there will be a built-in awareness of the order in which components must be upgraded, without relying on a human admin to know those details. When it comes to topology-wide upgrades, enterprises need it all -- control, flexibility and simplicity. They may want to apply a P4D patch to one instance but not others, or upgrade all instances at once. We can present admins with options rather than have them figure out custom upgrade procedures for updating executables and symlinks tweaking to get it right. * **Human-Initiated Failover** - HMS will be able to execute the steps to achieve a Failover with a click of a button (or maybe a few clicks). Stretch Goal: Failover should address failover mechanics comprehensively, even including such things are often outside the scope of Perforce administrators, but which truly necessary to achieve failover failover in some environments -- things like DNS or Virtual IP configuration changes. Note that _fault detection_ and _automated failover initiation_ are outside the scope of this project. Those advanced capabilities are the domain of Helix Cluster Manager. This project's more humble goal is simply to clarify and simplify the mechanics of executing a failvoer. * **Comprehending the Parts** - Knowing every detail will help Perforce admins undertand the many moving parts that keep a Helix environment happy. Details like: ** Seeing OS and hardware details ** Knowing Crontab files for key users on every machines ** Checking firewall configurations to ensure key ports are open ** Knowing what OS user accounts the server products run as ** Knowing what LDAP servers are used ** Knowing what ports each app is running on ** Capturing p4 info from each Helix Server instance All this and much more is needed and should be visible from a source more dynamic and reliably updated than a human-maintained wiki page. The data needs to be gathered centrally for the human administrator, but current. HA/DR and Failover --- With Helix Server 2015.2, High Availability (HA) and Disaster Recovery (DR) solutions are closer to being commoditized than ever before. But it's still not quite commodity. HMS will capture and codify what Perforce Consultants have done for individual customers in the past with custom solutions, automating all the wiring under a big red Failover button. The HMS Failover system will provide the following options for failing over: * **Scheduled vs. Unscheduled** - _Scheduled_ failover is easy to handle, because you can safely make assumptions about the state of the things. _Unscheduled_ (or _Hard_) occurs when something breaks, and involves different mechanics. * Local - Failover to an offline copy of the databases on the same machine. Useful for scenarios where the databases are not trusted, but there's no reason to suspect the hardware is damaged (e.g. sudden power loss). * HA - Failover to another server in the same data center (optionally sharing storage with the master for archive files). No data loss is acceptble, downtime should be minimal. * DR - Failover to another data center. Some data loss is expected, and higher downtime is deemed acceptable. System Components --- HMS v1.0 will consist of: * A dedicated Helix Server that models the global toplology, and knows every machine that contains Helix topology components of any kind, and knows about all the Helix Server _instances_ (or _data sets_). The server will maintain an internal map of the global topology that can be updated from various Helix instances, each of which is aware of its own topology (maintained with **p4 server** specs). * A broker, which adds new commands, e.g. **p4 hms start _fgs_**, **p4 hms failover _fgs_ -ha**', **p4 hms main start _fgs_**, and more). * SDP, on which HMS builds. The SDP includes scripts to manage and synchronize many small details -- synchronizing of the initialization scripts, ensuring no unversioned changes to backup scripts go unnoticed, tracking changes to crontab files for the perfoce user, and much more. This provides greater visibility to all the tiny details that matter and need to be considered when doing an upgrade. Things like disabling crontabs for backups during mainteancne windows, and turning them back on later. * SSH Keys need to work. Custom Extension --- Failover in an enterprise environment may always involve some degree of customization. HMS will capture everything that can be understood based on how the various Perforce software technologies work, and provide clear injection points for handling things likey to be topology-specific, such as redirecting traffic via DNS or Virtual IP changes. Project Status --- This Project is in the design phase. Road Map --- TBD. Certainly not all the functionality envisioned above will be in the first release. Schedule --- TBD. A v1.0 should be ready in Q2 2016, though it's not clear exactly what will be in there.