# Disaster Recovery Failover Plan

## Unix Standby Replica Server

---

# Overview

The DR server is being replicated using the Perforce built-in server to server replication via the `p4 journalcopy` command and an rsync script to replicate checkpoints. This document provides the steps to failover from the main server to the DR server.

> **Note:** This is a generic document that needs to be customized for the specific setup in your environment.

**For an unplanned failover, start at [Section 2.4: Stop the Replica](#24-stop-the-replica).**

---

# Failover Procedure

## 2.1 Check Replica Status

This is a verification that replication is up to date before we start the whole process.

On the replica machine, run `p4 pull -lj` to check the status of the replication. The command should return two matching journal sequence numbers.

## 2.2 Stop Access to Master Server (Done on Master Server)

Change the protect table to block access from everyone except the admin user and the service user. Make the last three lines of the protect table look like this:

```
list user * * -//...
super user service * //...
super user perforce * //...
```

Next, run:

```bash
/p4/1/bin/p4d_1 -r /p4/1/root -J /p4/1/logs/journal -jj /p4/1/checkpoints/p4_1
```

> **Note:** This command is run on the master. Its purpose is to cause all of the replicas to rotate their journal. That way, if you have to reset a replica after the failover, it has a very small journal to catch up from on the new master.

## 2.3 Check Replica Status

On the replica machine, run `p4 pull -lj` to check the status of the replication. Wait for the replica to fully catch up before shutting down the master and replica instances.

Once you have verified the replication is up to date, stop the master:

```bash
/p4/1/bin/p4d_1_init stop
```

## 2.4 Stop the Replica

Stop the replica by running:

```bash
/p4/1/bin/p4d_1_init stop
```

Now look in `/p4/1/journals.rep` for the highest numbered `p4_1.jnl.XXX` file in the folder and move that file to `/p4/1/logs/journal`, replacing the empty one that is there.

Check the `/p4/1/root/state` and `/p4/1/root/statejcopy` files. If the sequence numbers in the files do not match, then replay the journal into the root folder:

```bash
/p4/1/bin/p4d_1 -r /p4/1/root -jr -f /p4/1/logs/journal
```

**If this was an unplanned failover where the master died, run:**

```bash
/p4/1/bin/p4d_1 -r /p4/1/root -J /p4/1/logs/journal -jj /p4/1/checkpoints/p4_1
```

## 2.5 Change the Replica to Become the Master

1. Change the DNS entry or move the VIP from the master to the new master if you are using a VIP address.

2. On the replica server, edit `/p4/<INSTANCE>/root/server.id` and change the name to the name of the MASTER server id and save the file.

3. Edit `/p4/<INSTANCE>/root/sdp_server_type.txt` and change the type to `p4d_master`. (If you are failing over an edge server, the server type would be `p4d_edge`.)

4. Restart the server as the master:
   ```bash
   /p4/1/bin/p4d_1_init start
   ```

5. Edit the protect table and remove the `list user * * -//...` line that was added in the earlier step (if this is a planned failover).

After this is done, the former replica is now the master.

6. Run the following cleanup commands:
   ```bash
   rm /p4/1/root/state
   rm /p4/1/root/statejcopy
   rm /p4/1/root/rdb.lbr
   ```

## 2.6 Update the Crontab

> **Note:** Run all commands as the perforce user.

**On the new master server** (originally the replica):

```bash
crontab /p4/p4.crontab
```

**On the original master server** (now the replica):

```bash
crontab /p4/p4.crontab.replica
```

> **Tip:** It is a good idea to save the current crontab settings before making changes:
> ```bash
> # On master
> crontab -l > p4.crontab
>
> # On replica
> crontab -l > p4.crontab.replica
> ```
> You can copy these files to the other machine to ensure you load the current settings.

## 2.7 Check the Edge and Replica Server Status

On each edge and replica, make sure that the servers are replicating properly by running:

```bash
p4 pull -lj
```

If they are not replicating, perform the following on the edge/replica:

```bash
# Stop the server
/p4/1/bin/p4d_1_init stop

# Login to the new master
/p4/common/bin/p4master_run 1 p4 -p <master_server:port> login < /p4/common/config/.p4passwd.p4_1.admin
/p4/common/bin/p4master_run 1 p4 -p <master_server:port> login service

# Clean up state files
rm -f /p4/1/root/state /p4/1/root/rdb.lbr

# These only exist if the replica is using journalcopy rather than pull
rm -f /p4/1/root/statejcopy /p4/1/journals.rep/*

# Start the server
/p4/1/bin/p4d_1_init start
```

## 2.8 Change the Original Master to Become the Replica

1. On the original master, edit `/p4/1/root/server.id` and change the name to the name of the REPLICA server id and save the file.

2. Edit `/p4/<INSTANCE>/root/sdp_server_type.txt` and change the type to `p4d_standby`. (If you are failing over an edge server, the server type would be `p4d_edgerep`.)

3. Move the journal file:
   ```bash
   mv /p4/1/logs/journal /p4/1/logs/journal.orig.master
   ```

4. Delete the following files if they exist:
   ```bash
   rm -f /p4/1/root/state
   rm -f /p4/1/root/statejcopy
   rm -f /p4/1/root/rdb.lbr
   ```

5. Login to the master server as the service user so that replication can run:
   ```bash
   # Only if your site is using SSL
   /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port -u service trust

   # Login
   /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port login < /p4/common/config/.p4passwd.p4_1.admin
   /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port login service
   ```

6. Start the server as a replica:
   ```bash
   /p4/1/bin/p4d_1_init start
   ```

7. Check replication:
   ```bash
   p4 login < /p4/common/config/.p4passwd.p4_1.admin
   p4 pull -lj
   ```

   You should see two matching journal sequence numbers. The `p4 pull -lj` will only succeed when run on a replica pulling from a master, proving that the roles have been switched.

8. Run the sync_replica.sh command to verify it works properly:
   ```bash
   /p4/common/bin/p4master_run 1 /p4/common/bin/sync_replica.sh
   ```

---

# Important Notes

## Note 1: Master Server Down for More Than 7 Days

If the original master server has been down for more than 7 days, you will have to reset the replica before starting it. The reason for the 7-day value is that the number of old checkpoints and journals to keep is set to 7 in the `p4_vars` file. The replication uses the old journals to catch up from where it last stopped, so if the old ones have rotated off, you have to reset the replica with a new checkpoint and updated versioned files.

### To Reset the Replica

If your servers are set up to rsync without a password, you can reset the replica by running:

```bash
/p4/common/bin/p4master_run 1 /p4/common/bin/recreate_db_sync_replica.sh
```

The `recreate_db_sync_replica.sh` script performs the following steps:

```bash
# Sync checkpoints from master
rsync -avz --delete perforce@master_server:/p4/1/checkpoints/ /p4/1/checkpoints

# Clean up old database files
rm -rf /p4/1/root/db.*
rm -rf /p4/1/offline_db/db.*
rm -f /p4/1/logs/journal

# Check for the highest numbered checkpoint
ls -lah /p4/1/checkpoints
# Note the highest numbered p4_1.ckp.#.gz file (assume 10 for this example)

# Recover the checkpoint
/p4/1/bin/p4d_1 -r /p4/1/root -jr -z /p4/1/checkpoints/p4_1.ckp.10.gz

# Login and start
/p4/1/bin/p4_1 -p master_server:port -u service login
/p4/1/bin/p4d_1_init start

# Recover to offline_db
/p4/1/bin/p4d_1 -r /p4/1/offline_db -jr -z /p4/1/checkpoints/p4_1.ckp.10.gz
```

Optionally, you may want to sync depot files:

```bash
rsync -avz --delete perforce@master_server:/p4/1/depots/ /p4/1/depots
```

Or verify the depot files are up to date:

```bash
/p4/common/bin/p4master_run 1 /p4/common/bin/p4verify.sh
```

## Note 2: Master Server is Inoperable

If the master server is inoperable, follow these steps:

1. On the replica server, run:
   ```bash
   /p4/1/bin/p4d_1_init stop
   ```

2. Continue with the steps in [Section 2.4: Stop the Replica](#24-stop-the-replica) above.
