Some customers have asked about our plans to eliminate the need for human involvement in failover process. The short version is that automating the execution of a specific failover plan is straightforward, but automating the initiation of failover, making the decision to do so, is more complex.
At this point in history, we do not plan to remove humans from High Availability (HA) and/or Disaster Recovery (DR) failover operations at the application layer. Instead, we're investing our automation efforts at active monitoring of things most likely to cause outages, and mitigating those before they happen. We also provide much guidance on preparation for failover, such has having HA and/or DR systems with real-time replication in place, as well as duplication of critical databases on the master server machine, to provide a plethora of options to recover from the myriad things that might go bump in the night. Those things are already standard and illustrated in our Battle School Workshop Lab Environment.
A proper solution intended to remove humans from the "pushing the FAILOVER button" role, however, needs at least Expert Systems (an early form of Artificial Intelligence) to determine if and when to break the glass and push the big red FAILOVER button. That is not a current priority, but read on.
We have had customers who thought they were doing something good by having automated systems initiate failover without human intervention, to remove delays associated with getting a trained human to a keyboard in order to execute a failover. While such a thing is feasible, it is not easy. Some have assumed it should be easy, but solutions built on the assumption that it is easy have higher risk of a robot with an itchy trigger finger hitting the big red FAILOVER button and causing an unnecessary outage and even data loss.
The challenge is one of getting enough sophistication to meet the inherent complexity of determining whether a failover should be executed, and making the best choice of how to failover given options available and an understanding of the problem at hand. Proper reaction to faults of various kinds requires understanding the fault and the implications of the fault, and figuring out the best course of action. For example, if the fault is due to a network issue, a failover will lead to no good outcome, and could make things worse by causing a "split brain" scenario.
We have done a fair amount of work on systems to make certain types of failover fully automated. Doing it right means development of Expert Systems based on comprehensive Fault Tree Diagrams, or perhaps using more advanced AI solutions. These start with various initial stimulus, starting with an indication that something may be wrong, such as a bad ping response or heartbeat failure. Then investigation of the reported fault is done for each of the possible root causes that would result in the observed symptom. Only once the root cause is understood well enough, a decision must first be made to wait it out, or take corrective action. Executing an HA/DR failover is only one of many potential corrective actions.
Developing the Expert Systems to imitate what a trained human admin would do, following those Fault Tree Diagrams to determine the root cause and then executing appropriate corrective action based on the options available (such as having HA and/or DR servers available), is VERY FUN STUFF. But it is also a lot of work, and something to be started with eyes and mind wide open to the scope of what is involved. In addition to being a lot of initial work to setup, it requires a high and continuous ongoing effort to maintain, as any changes in the topology of the application or even software upgrades require revisiting the entire system.
We have no current plans to pursue fully automated initiation of failover. We routinely work with customers to automate execution of failover plans. To that end we have developed the P4 Management System (P4MS). Among other things, this system distills the complexity of failover options and methods to a short list of failover options that a routine data center technician, not necessarily a P4 Admin or expert, could execute with reasonable hope of a good outcome. The P4MS is a platform for delivery of custom consulting services, not a fully supported product. Each deployed solution has common elements but entails custom and site-specific elements base on various factors. Some sample custom elements (just a few of many) are:
We love automation of things! However, Rule #1 of Automation is: Automate only what you thoroughly understand. (Exception: If you are coding to learn for R&D, but that doesn't apply on production systems). Understanding how to execute on a failover plan is something well understood, and that lends itself well to automation. But understanding if a failover should be executed is not so well understood, and thus not optimal at this point in history for automation. At least, not until Artificial Intelligence technologies such as Machine Learning can be brought to bear.
For example, say there is an indication that there is a bad ping response to the P4 Server. What's the right thing to do? Well, a human would pick up the phone or check the AWS status to see if there's a known network issue, and find the ETA. If the problem is a known network issue expected to be fixed by the networking crew in 15 minutes, what is the correct response for the application administrator? Most likely, DO ABSOLUTELY NOTHING! Just monitor the situation and wait the 15 minutes. What if the reason for the bad ping response is a DDoS attack, either by a bad actor, or even a good guy whose well-intended automation went rouge and attacked the server? Executing a failover will give that errant automation a new target to destroy! Failover is a bad choice in that situation.
Eliminating human intervention from the failover process is a worthy but lofty goal. For now, we prefer the option have having trained humans available to evaluate each given failure scenario, and react as quickly as possible given the situation. The most immediate task is understanding the root cause of the situation fully. In our internal Battle School Workshop training, we prepare for such situations.
When an actual outage occurs, deciding what reaction to take is NOT well understood, except in the simplest failure scenarios (which are historically quite rare in most environments). Modern hardware infrastructure makes the most obvious failures (e.g. sudden disk failure) so completely transparent to the application layer that we are completely unaware of them, other than potential performance glitches.
If you would like us to pursue fully automated failover, that is something we are open to discuss and excited to collaborate on. We would likely explore AI solutions related to handling systems failures. An ironic challenge is that actual systems failures on modern hardware are so rare that good data for them is hard to come by, so getting the data needed to train an AI to handle smart failover would require simulation. While using simulated data is common in the AI world, it does not give the same degree of confidence that real situations would.
Some customers have asked about our plans to eliminate the need for human involvement in failover process. The short version is that automating the _execution_ of a specific failover plan is straightforward, but automating the _initiation_ of failover, making the decision to do so, is more complex. At this point in history, we do not plan to remove humans from High Availability (HA) and/or Disaster Recovery (DR) failover operations at the application layer. Instead, we're investing our automation efforts at active monitoring of things most likely to cause outages, and mitigating those before they happen. We also provide much guidance on preparation for failover, such has having HA and/or DR systems with real-time replication in place, as well as duplication of critical databases on the master server machine, to provide a plethora of options to recover from the myriad things that might go bump in the night. Those things are already standard and illustrated in our Battle School Workshop Lab Environment. A proper solution intended to remove humans from the "pushing the FAILOVER button" role, however, needs at least Expert Systems (an early form of Artificial Intelligence) to determine if and when to break the glass and push the big red FAILOVER button. That is not a current priority, but read on. We have had customers who thought they were doing something good by having automated systems initiate failover without human intervention, to remove delays associated with getting a trained human to a keyboard in order to execute a failover. While such a thing is feasible, it is not easy. Some have assumed it should be easy, but solutions built on the assumption that it is easy have higher risk of a robot with an itchy trigger finger hitting the big red FAILOVER button and causing an unnecessary outage and even data loss. The challenge is one of getting enough sophistication to meet the inherent complexity of determining whether a failover should be executed, and making the best choice of how to failover given options available and an understanding of the problem at hand. Proper reaction to faults of various kinds requires understanding the fault and the implications of the fault, and figuring out the best course of action. For example, if the fault is due to a network issue, a failover will lead to no good outcome, and could make things worse by causing a "split brain" scenario. We have done a fair amount of work on systems to make certain types of failover fully automated. Doing it right means development of Expert Systems based on comprehensive Fault Tree Diagrams, or perhaps using more advanced AI solutions. These start with various initial stimulus, starting with an indication that something may be wrong, such as a bad ping response or heartbeat failure. Then investigation of the reported fault is done for each of the possible root causes that would result in the observed symptom. Only once the root cause is understood well enough, a decision must first be made to wait it out, or take corrective action. Executing an HA/DR failover is only one of many potential corrective actions. Developing the Expert Systems to imitate what a trained human admin would do, following those Fault Tree Diagrams to determine the root cause and then executing appropriate corrective action based on the options available (such as having HA and/or DR servers available), is VERY FUN STUFF. But it is also a lot of work, and something to be started with eyes and mind wide open to the scope of what is involved. In addition to being a lot of initial work to setup, it requires a high and continuous ongoing effort to maintain, as any changes in the topology of the application or even software upgrades require revisiting the entire system. We have no current plans to pursue fully automated _initiation_ of failover. We routinely work with customers to automate _execution_ of failover plans. To that end we have developed the P4 Management System (P4MS). Among other things, this system distills the complexity of failover options and methods to a short list of failover options that a routine data center technician, not necessarily a P4 Admin or expert, could execute with reasonable hope of a good outcome. The P4MS is a platform for delivery of custom consulting services, not a fully supported product. Each deployed solution has common elements but entails custom and site-specific elements base on various factors. Some sample custom elements (just a few of many) are: * how user traffic is routed to servers * what the P4 Server topology is, as captured in a [P4Topology file](https://workshop.perforce.com/projects/p4ms/files/dev/p4ms/dev/p4/common/site/config/P4Topology.cfg.sample) * what failover options are deemed online and ready (as opposed to ones for which preparations are incomplete). * Local organizational boundries, e.g. whether P4 Admins have access and authority to initiate DNS changes. We love automation of things! However, Rule #1 of Automation is: Automate only what you thoroughly understand. (Exception: If you are coding to learn for R&D, but that doesn't apply on production systems). Understanding _how_ to execute on a failover plan is something well understood, and that lends itself well to automation. But understanding _if_ a failover _should_ be executed is not so well understood, and thus not optimal at this point in history for automation. At least, not until Artificial Intelligence technologies such as Machine Learning can be brought to bear. For example, say there is an indication that there is a bad ping response to the P4 Server. What's the right thing to do? Well, a human would pick up the phone or check the AWS status to see if there's a known network issue, and find the ETA. If the problem is a known network issue expected to be fixed by the networking crew in 15 minutes, what is the correct response for the application administrator? Most likely, DO ABSOLUTELY NOTHING! Just monitor the situation and wait the 15 minutes. What if the reason for the bad ping response is a DDoS attack, either by a bad actor, or even a good guy whose well-intended automation went rouge and attacked the server? Executing a failover will give that errant automation a new target to destroy! Failover is a bad choice in that situation. Eliminating human intervention from the failover process is a worthy but lofty goal. For now, we prefer the option have having trained humans available to evaluate each given failure scenario, and react as quickly as possible given the situation. The most immediate task is understanding the root cause of the situation fully. In our internal Battle School Workshop training, we prepare for such situations. When an actual outage occurs, deciding what reaction to take is NOT well understood, except in the simplest failure scenarios (which are historically quite rare in most environments). Modern hardware infrastructure makes the most obvious failures (e.g. sudden disk failure) so completely transparent to the application layer that we are completely unaware of them, other than potential performance glitches. If you would like us to pursue fully automated failover, that is something we are open to discuss and excited to collaborate on. We would likely explore AI solutions related to handling systems failures. An ironic challenge is that actual systems failures on modern hardware are so rare that good data for them is hard to come by, so getting the data needed to train an AI to handle smart failover would require simulation. While using simulated data is common in the AI world, it does not give the same degree of confidence that real situations would.
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#1 | 31835 | C. Thomas Tyler | Added customer-visible doc on failover. |