= Server Deployment Package (SDP) for Perforce Helix: SDP Failover Guide (for Unix and Windows) Perforce Professional Services <consulting@perforce.com> :revnumber: v2019.3 :revdate: 2020-08-05 :doctype: book :icons: font :toc: :xrefstyle: short == Preface The Server Deployment Package (SDP) is the implementation of Perforce's recommendations for operating and managing a production Perforce Helix Core Version Control System. It is intended to provide the Helix Core administration team with tools to help: * High Availability (HA) * Disaster Recovery (DR) This guide is intended to provide instructions for failover in an SDP environment using built in Helix Core features. For more details see: * http://www.perforce.com/manuals/p4sag/Content/P4SAG/failover.html#Failover[Sysadmin Guide - Failover] *Please Give Us Feedback* Perforce welcomes feedback from our users. Please send any suggestions for improving this document or the SDP to consulting@perforce.com. :sectnums: == Overview We need to consider planned vs unplanned failover. Planned may be due to upgrading the core Operating System or some other dependency in your infrastructure, or a similar activity. Unplanned covers risks you are seeking to mitigate with failover: * loss of a machine, or some machine related hardware failure * loss of a VM cluster * failure of storage * loss of a data center or machine room * etc... See also https://www.perforce.com/manuals/cmdref/Content/CmdRef/p4_failover.html[p4 failover in Command Reference Guide] === Planning HA failover should not require a P4PORT change for end users. Depending on your topology, you can avoid changing P4PORT by having users set P4PORT to an alias, e.g. `perforce.p4demo.com` (or just `perforce`), or `perforce_syd.p4demo.com` (or just `perforce_syd`). During failover, that would be targeted by doing something like: * Changing a DNS alias so `perforce.p4demo.com` points to the backup machine. * Changing the Perforce broker configuration to target the backup machine. * Changing the http://en.wikipedia.org/wiki/Virtual_IP_address[Virtual IP] configuration. * Changing the http://en.wikipedia.org/wiki/Anycast[anycast] routing configuration. == Planned Failover In this instance you can run `p4 failover` with the active participation of its upstream server. We are going to provide examples with the following assumptions: * serverid `master_1` is current commit, running on machine `bos-helix-01` * serverid `p4d_ha_bos` is HA server * DNS alias `perforce` is set to `bos-helix-01` === Prerequisites You need to ensure: . you are running p4d 2018.2 or later for your commit and all replica instances, preferably 2020.1 source /p4/common/bin/p4_vars 1 p4 info | grep version . your failover target server instance is of type `standby` or `forwarding-standby` + On HA machine: p4 info : ServerID: p4d_ha_bos Server services: standby Replica of: perforce:1999 : . it has Options mandatory set in its server spec p4 server -o p4d_ha_bos | grep Options Options: mandatory . you have a valid `license` installed in `/p4/1/root` (<instance> root) + On HA machine: cat /p4/1/license . Monitoring is enabled - so the following works: p4 monitor show -al . DNS changes are possible so that downstream replicas can seamlessly connect to HA server . Current `pull` status is valid p4 pull -lj === Failing over The actions are: . Run `p4 failover` in reporting mode on HA machine: p4 failover + Successful output looks like: Checking if failover might be possible ... Checking for archive file content not transferred ... Verifying content of recently update archive files ... After addressing any reported issues that might prevent failover, use --yes or -y to execute the failover. . Perform failover: p4 failover --yes + Output should be something like: Starting failover process ... Refusing new commands on server from which failover is occurring ... Giving commands already running time to complete ... Stalling commands on server from which failover is occurring ... Waiting for 'journalcopy' to complete its work ... Waiting for 'pull -L' to complete its work ... Waiting for 'pull -u' to complete its work ... Checking for archive file content not transferred ... Verifying content of recently updated archive files ... Stopping server from which failover is occurring ... Moving latest journalcopy'd journal into place as the active journal ... Updating configuration of the failed-over server ... Restarting this server ... + During this time, if you run commands against the master, you may see: Server currently in failover mode, try again after failover has completed . Change the DNS entries so downstream replicas (and users) will connect to the new master (that was previously HA) . Validate that your downstream replicas are communicating with your new master + On each replica machine: p4 pull -lj + Or against the new master: p4 servers -J + Check output of `p4 info`: : Server address: box-helix-02 : ServerID: master_1 Server services: commit-server : . Make sure the old server spec (`p4d_ha_bos`) has correctly had its `Options:` field set to `nomandatory` (otherwise all replication would stop!) == Unplanned Failover In this case there is no active participation of upstream server, so there is an increase risk of lost data. We assume we are still failing over to the HA machine, so: * Failover target is `standby` or `forwarding-standby` * Server spec still has `Options:` set to `mandatory` * Original master is not running The output of `p4 failover` on the DR machine might be: Checking if failover might be possible ... Server ID must be specified in the '-s' or --serverid' argument for a failover without the participation of the server from which failover is occurring. Checking for archive file content not transferred ... Verifying content of recently update archive files ... After addressing any reported issues that might prevent failover, use --yes or -y to execute the failover. . Execute `p4 failover` with the extra parameter to specify server we are failing over from: p4 failover --serverid master_1 --yes Expected output is somewhat shorter than for planned failover: Starting failover process ... Waiting for 'pull -L' to complete its work ... Checking for archive file content not transferred ... Verifying content of recently updated archive files ... Moving latest journalcopy'd journal into place as the active journal ... Updating configuration of the failed-over server ... Restarting this server ... == Old style failover This does not use the `p4 failover` command (so is valid for pre-2018.2 p4d versions) See: https://community.perforce.com/s/article/2495[KB - Failing over to a Replica]
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#17 | 30383 | C. Thomas Tyler | Updated rev{number,date} fields in adoc files for release. | ||
#16 | 30000 | C. Thomas Tyler |
Refined Release Notes and top-level README.md file in preparation for coming 2023.2 release. Adjusted Makefile in doc directory to also generate top-level README.html from top-level README.md file so that the HTML file is reliably updated in the SDP release process. Updated :revnumber: and :revdate: docs in AsciiDoc files to indicate that the are still current. Avoiding regen of ReleaseNotes.pdf binary file since that will need at least one more update before shipping SDP 2023.2. |
||
#15 | 29728 | Robert Cowham | Add note re establishing p4 trust and logging in commit service user when background submits enabled | ||
#14 | 29608 | C. Thomas Tyler | Doc updates as part of release cycle. | ||
#13 | 29414 | Robert Cowham | Added a failback section | ||
#12 | 29236 | C. Thomas Tyler |
Updated all doc rev numbers for supported and unsupported docs to 2022.2 as prep for SDP 2022.2 release. |
||
#11 | 28837 | C. Thomas Tyler | Updated docs for r22.1 release. | ||
#10 | 28374 | C. Thomas Tyler |
Updated :revnumber: and :revdate: fields for *.adoc files for release. |
||
#9 | 28099 | Robert Cowham | Updated Failover Guide to add a couple of post failover checks. | ||
#8 | 28083 | C. Thomas Tyler |
Updated :revnumber: to 2021.1 for remaining docs for upcoming SDP r21.1 release. |
||
#7 | 27722 | C. Thomas Tyler |
Refinements to @27712: * Resolved one out-of-date file (verify_sdp.sh). * Added missing adoc file for which HTML file had a change (WorkflowEnforcementTriggers.adoc). * Updated revdate/revnumber in *.adoc files. * Additional content updates in Server/Unix/p4/common/etc/cron.d/ReadMe.md. * Bumped version numbers on scripts with Version= def'n. * Generated HTML, PDF, and doc/gen files: - Most HTML and all PDF are generated using Makefiles that call an AsciiDoc utility. - HTML for Perl scripts is generated with pod2html. - doc/gen/*.man.txt files are generated with .../tools/gen_script_man_pages.sh. #review-27712 |
||
#6 | 27709 | Robert Cowham |
Note check for serverlocks. Fix typo in path in failover. |
||
#5 | 26851 | Robert Cowham |
Fix typo in tmpfs /etc/fstab entry which stopped it working in the doc. Mention in pre-requisites for failover and failover guide the need to review OS Config for your failover server. Document Ubuntu 2020.04 LTS and CentOS/RHEL 8 support. Note performance has been observed to be better with CentOS. Document pull.sh and submit.sh in main SDP guide (remove from Unsupported doc). Update comments in triggers to reflect that they are reference implementations, not just examples. No code change. |
||
#4 | 26747 | Robert Cowham |
Update with some checklists for failover to ensure valid. Update to v2020.1 Add Usage sections where missing to Unix guide Refactor the content in Unix guide to avoid repetition and make things read more sensibly. |
||
#3 | 26727 | Robert Cowham |
Add section on server host naming conventions Clarify HA and DR, and update links across docs Fix doc structure for Appendix numbering |
||
#2 | 26660 | Robert Cowham |
New common Failover Guide - removed old one. This is based on 'p4 failover' so requires 2018.2 or later. |
||
#1 | 26654 | Robert Cowham |
First draft of new Failover Guide using "p4 failover" Linked from SDP Unix Guide |