DR-Failover-steps-Unix-standby.md #1

  • //
  • guest/
  • russell_jackson/
  • sdp/
  • doc/
  • DR-Failover-steps-Unix-standby.md
  • Markdown
  • View
  • Commits
  • Open Download .zip Download (8 KB)

Disaster Recovery Failover Plan

Unix Standby Replica Server


Overview

The DR server is being replicated using the Perforce built-in server to server replication via the p4 journalcopy command and an rsync script to replicate checkpoints. This document provides the steps to failover from the main server to the DR server.

Note: This is a generic document that needs to be customized for the specific setup in your environment.

For an unplanned failover, start at Section 2.4: Stop the Replica.


Failover Procedure

2.1 Check Replica Status

This is a verification that replication is up to date before we start the whole process.

On the replica machine, run p4 pull -lj to check the status of the replication. The command should return two matching journal sequence numbers.

2.2 Stop Access to Master Server (Done on Master Server)

Change the protect table to block access from everyone except the admin user and the service user. Make the last three lines of the protect table look like this:

list user * * -//...
super user service * //...
super user perforce * //...

Next, run:

/p4/1/bin/p4d_1 -r /p4/1/root -J /p4/1/logs/journal -jj /p4/1/checkpoints/p4_1

Note: This command is run on the master. Its purpose is to cause all of the replicas to rotate their journal. That way, if you have to reset a replica after the failover, it has a very small journal to catch up from on the new master.

2.3 Check Replica Status

On the replica machine, run p4 pull -lj to check the status of the replication. Wait for the replica to fully catch up before shutting down the master and replica instances.

Once you have verified the replication is up to date, stop the master:

/p4/1/bin/p4d_1_init stop

2.4 Stop the Replica

Stop the replica by running:

/p4/1/bin/p4d_1_init stop

Now look in /p4/1/journals.rep for the highest numbered p4_1.jnl.XXX file in the folder and move that file to /p4/1/logs/journal, replacing the empty one that is there.

Check the /p4/1/root/state and /p4/1/root/statejcopy files. If the sequence numbers in the files do not match, then replay the journal into the root folder:

/p4/1/bin/p4d_1 -r /p4/1/root -jr -f /p4/1/logs/journal

If this was an unplanned failover where the master died, run:

/p4/1/bin/p4d_1 -r /p4/1/root -J /p4/1/logs/journal -jj /p4/1/checkpoints/p4_1

2.5 Change the Replica to Become the Master

  1. Change the DNS entry or move the VIP from the master to the new master if you are using a VIP address.

  2. On the replica server, edit /p4/<INSTANCE>/root/server.id and change the name to the name of the MASTER server id and save the file.

  3. Edit /p4/<INSTANCE>/root/sdp_server_type.txt and change the type to p4d_master. (If you are failing over an edge server, the server type would be p4d_edge.)

  4. Restart the server as the master:

    /p4/1/bin/p4d_1_init start
  5. Edit the protect table and remove the list user * * -//... line that was added in the earlier step (if this is a planned failover).

After this is done, the former replica is now the master.

  1. Run the following cleanup commands:
    rm /p4/1/root/state
    rm /p4/1/root/statejcopy
    rm /p4/1/root/rdb.lbr

2.6 Update the Crontab

Note: Run all commands as the perforce user.

On the new master server (originally the replica):

crontab /p4/p4.crontab

On the original master server (now the replica):

crontab /p4/p4.crontab.replica

Tip: It is a good idea to save the current crontab settings before making changes:

# On master
crontab -l > p4.crontab

# On replica
crontab -l > p4.crontab.replica

You can copy these files to the other machine to ensure you load the current settings.

2.7 Check the Edge and Replica Server Status

On each edge and replica, make sure that the servers are replicating properly by running:

p4 pull -lj

If they are not replicating, perform the following on the edge/replica:

# Stop the server
/p4/1/bin/p4d_1_init stop

# Login to the new master
/p4/common/bin/p4master_run 1 p4 -p <master_server:port> login < /p4/common/config/.p4passwd.p4_1.admin
/p4/common/bin/p4master_run 1 p4 -p <master_server:port> login service

# Clean up state files
rm -f /p4/1/root/state /p4/1/root/rdb.lbr

# These only exist if the replica is using journalcopy rather than pull
rm -f /p4/1/root/statejcopy /p4/1/journals.rep/*

# Start the server
/p4/1/bin/p4d_1_init start

2.8 Change the Original Master to Become the Replica

  1. On the original master, edit /p4/1/root/server.id and change the name to the name of the REPLICA server id and save the file.

  2. Edit /p4/<INSTANCE>/root/sdp_server_type.txt and change the type to p4d_standby. (If you are failing over an edge server, the server type would be p4d_edgerep.)

  3. Move the journal file:

    mv /p4/1/logs/journal /p4/1/logs/journal.orig.master
  4. Delete the following files if they exist:

    rm -f /p4/1/root/state
    rm -f /p4/1/root/statejcopy
    rm -f /p4/1/root/rdb.lbr
  5. Login to the master server as the service user so that replication can run:

    # Only if your site is using SSL
    /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port -u service trust
    
    # Login
    /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port login < /p4/common/config/.p4passwd.p4_1.admin
    /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port login service
  6. Start the server as a replica:

    /p4/1/bin/p4d_1_init start
  7. Check replication:

    p4 login < /p4/common/config/.p4passwd.p4_1.admin
    p4 pull -lj

    You should see two matching journal sequence numbers. The p4 pull -lj will only succeed when run on a replica pulling from a master, proving that the roles have been switched.

  8. Run the sync_replica.sh command to verify it works properly:

    /p4/common/bin/p4master_run 1 /p4/common/bin/sync_replica.sh

Important Notes

Note 1: Master Server Down for More Than 7 Days

If the original master server has been down for more than 7 days, you will have to reset the replica before starting it. The reason for the 7-day value is that the number of old checkpoints and journals to keep is set to 7 in the p4_vars file. The replication uses the old journals to catch up from where it last stopped, so if the old ones have rotated off, you have to reset the replica with a new checkpoint and updated versioned files.

To Reset the Replica

If your servers are set up to rsync without a password, you can reset the replica by running:

/p4/common/bin/p4master_run 1 /p4/common/bin/recreate_db_sync_replica.sh

The recreate_db_sync_replica.sh script performs the following steps:

# Sync checkpoints from master
rsync -avz --delete perforce@master_server:/p4/1/checkpoints/ /p4/1/checkpoints

# Clean up old database files
rm -rf /p4/1/root/db.*
rm -rf /p4/1/offline_db/db.*
rm -f /p4/1/logs/journal

# Check for the highest numbered checkpoint
ls -lah /p4/1/checkpoints
# Note the highest numbered p4_1.ckp.#.gz file (assume 10 for this example)

# Recover the checkpoint
/p4/1/bin/p4d_1 -r /p4/1/root -jr -z /p4/1/checkpoints/p4_1.ckp.10.gz

# Login and start
/p4/1/bin/p4_1 -p master_server:port -u service login
/p4/1/bin/p4d_1_init start

# Recover to offline_db
/p4/1/bin/p4d_1 -r /p4/1/offline_db -jr -z /p4/1/checkpoints/p4_1.ckp.10.gz

Optionally, you may want to sync depot files:

rsync -avz --delete perforce@master_server:/p4/1/depots/ /p4/1/depots

Or verify the depot files are up to date:

/p4/common/bin/p4master_run 1 /p4/common/bin/p4verify.sh

Note 2: Master Server is Inoperable

If the master server is inoperable, follow these steps:

  1. On the replica server, run:

    /p4/1/bin/p4d_1_init stop
  2. Continue with the steps in Section 2.4: Stop the Replica above.

# Disaster Recovery Failover Plan

## Unix Standby Replica Server

---

# Overview

The DR server is being replicated using the Perforce built-in server to server replication via the `p4 journalcopy` command and an rsync script to replicate checkpoints. This document provides the steps to failover from the main server to the DR server.

> **Note:** This is a generic document that needs to be customized for the specific setup in your environment.

**For an unplanned failover, start at [Section 2.4: Stop the Replica](#24-stop-the-replica).**

---

# Failover Procedure

## 2.1 Check Replica Status

This is a verification that replication is up to date before we start the whole process.

On the replica machine, run `p4 pull -lj` to check the status of the replication. The command should return two matching journal sequence numbers.

## 2.2 Stop Access to Master Server (Done on Master Server)

Change the protect table to block access from everyone except the admin user and the service user. Make the last three lines of the protect table look like this:

```
list user * * -//...
super user service * //...
super user perforce * //...
```

Next, run:

```bash
/p4/1/bin/p4d_1 -r /p4/1/root -J /p4/1/logs/journal -jj /p4/1/checkpoints/p4_1
```

> **Note:** This command is run on the master. Its purpose is to cause all of the replicas to rotate their journal. That way, if you have to reset a replica after the failover, it has a very small journal to catch up from on the new master.

## 2.3 Check Replica Status

On the replica machine, run `p4 pull -lj` to check the status of the replication. Wait for the replica to fully catch up before shutting down the master and replica instances.

Once you have verified the replication is up to date, stop the master:

```bash
/p4/1/bin/p4d_1_init stop
```

## 2.4 Stop the Replica

Stop the replica by running:

```bash
/p4/1/bin/p4d_1_init stop
```

Now look in `/p4/1/journals.rep` for the highest numbered `p4_1.jnl.XXX` file in the folder and move that file to `/p4/1/logs/journal`, replacing the empty one that is there.

Check the `/p4/1/root/state` and `/p4/1/root/statejcopy` files. If the sequence numbers in the files do not match, then replay the journal into the root folder:

```bash
/p4/1/bin/p4d_1 -r /p4/1/root -jr -f /p4/1/logs/journal
```

**If this was an unplanned failover where the master died, run:**

```bash
/p4/1/bin/p4d_1 -r /p4/1/root -J /p4/1/logs/journal -jj /p4/1/checkpoints/p4_1
```

## 2.5 Change the Replica to Become the Master

1. Change the DNS entry or move the VIP from the master to the new master if you are using a VIP address.

2. On the replica server, edit `/p4/<INSTANCE>/root/server.id` and change the name to the name of the MASTER server id and save the file.

3. Edit `/p4/<INSTANCE>/root/sdp_server_type.txt` and change the type to `p4d_master`. (If you are failing over an edge server, the server type would be `p4d_edge`.)

4. Restart the server as the master:
   ```bash
   /p4/1/bin/p4d_1_init start
   ```

5. Edit the protect table and remove the `list user * * -//...` line that was added in the earlier step (if this is a planned failover).

After this is done, the former replica is now the master.

6. Run the following cleanup commands:
   ```bash
   rm /p4/1/root/state
   rm /p4/1/root/statejcopy
   rm /p4/1/root/rdb.lbr
   ```

## 2.6 Update the Crontab

> **Note:** Run all commands as the perforce user.

**On the new master server** (originally the replica):

```bash
crontab /p4/p4.crontab
```

**On the original master server** (now the replica):

```bash
crontab /p4/p4.crontab.replica
```

> **Tip:** It is a good idea to save the current crontab settings before making changes:
> ```bash
> # On master
> crontab -l > p4.crontab
>
> # On replica
> crontab -l > p4.crontab.replica
> ```
> You can copy these files to the other machine to ensure you load the current settings.

## 2.7 Check the Edge and Replica Server Status

On each edge and replica, make sure that the servers are replicating properly by running:

```bash
p4 pull -lj
```

If they are not replicating, perform the following on the edge/replica:

```bash
# Stop the server
/p4/1/bin/p4d_1_init stop

# Login to the new master
/p4/common/bin/p4master_run 1 p4 -p <master_server:port> login < /p4/common/config/.p4passwd.p4_1.admin
/p4/common/bin/p4master_run 1 p4 -p <master_server:port> login service

# Clean up state files
rm -f /p4/1/root/state /p4/1/root/rdb.lbr

# These only exist if the replica is using journalcopy rather than pull
rm -f /p4/1/root/statejcopy /p4/1/journals.rep/*

# Start the server
/p4/1/bin/p4d_1_init start
```

## 2.8 Change the Original Master to Become the Replica

1. On the original master, edit `/p4/1/root/server.id` and change the name to the name of the REPLICA server id and save the file.

2. Edit `/p4/<INSTANCE>/root/sdp_server_type.txt` and change the type to `p4d_standby`. (If you are failing over an edge server, the server type would be `p4d_edgerep`.)

3. Move the journal file:
   ```bash
   mv /p4/1/logs/journal /p4/1/logs/journal.orig.master
   ```

4. Delete the following files if they exist:
   ```bash
   rm -f /p4/1/root/state
   rm -f /p4/1/root/statejcopy
   rm -f /p4/1/root/rdb.lbr
   ```

5. Login to the master server as the service user so that replication can run:
   ```bash
   # Only if your site is using SSL
   /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port -u service trust

   # Login
   /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port login < /p4/common/config/.p4passwd.p4_1.admin
   /p4/common/bin/p4master_run <instance> p4 -p ssl:master:port login service
   ```

6. Start the server as a replica:
   ```bash
   /p4/1/bin/p4d_1_init start
   ```

7. Check replication:
   ```bash
   p4 login < /p4/common/config/.p4passwd.p4_1.admin
   p4 pull -lj
   ```

   You should see two matching journal sequence numbers. The `p4 pull -lj` will only succeed when run on a replica pulling from a master, proving that the roles have been switched.

8. Run the sync_replica.sh command to verify it works properly:
   ```bash
   /p4/common/bin/p4master_run 1 /p4/common/bin/sync_replica.sh
   ```

---

# Important Notes

## Note 1: Master Server Down for More Than 7 Days

If the original master server has been down for more than 7 days, you will have to reset the replica before starting it. The reason for the 7-day value is that the number of old checkpoints and journals to keep is set to 7 in the `p4_vars` file. The replication uses the old journals to catch up from where it last stopped, so if the old ones have rotated off, you have to reset the replica with a new checkpoint and updated versioned files.

### To Reset the Replica

If your servers are set up to rsync without a password, you can reset the replica by running:

```bash
/p4/common/bin/p4master_run 1 /p4/common/bin/recreate_db_sync_replica.sh
```

The `recreate_db_sync_replica.sh` script performs the following steps:

```bash
# Sync checkpoints from master
rsync -avz --delete perforce@master_server:/p4/1/checkpoints/ /p4/1/checkpoints

# Clean up old database files
rm -rf /p4/1/root/db.*
rm -rf /p4/1/offline_db/db.*
rm -f /p4/1/logs/journal

# Check for the highest numbered checkpoint
ls -lah /p4/1/checkpoints
# Note the highest numbered p4_1.ckp.#.gz file (assume 10 for this example)

# Recover the checkpoint
/p4/1/bin/p4d_1 -r /p4/1/root -jr -z /p4/1/checkpoints/p4_1.ckp.10.gz

# Login and start
/p4/1/bin/p4_1 -p master_server:port -u service login
/p4/1/bin/p4d_1_init start

# Recover to offline_db
/p4/1/bin/p4d_1 -r /p4/1/offline_db -jr -z /p4/1/checkpoints/p4_1.ckp.10.gz
```

Optionally, you may want to sync depot files:

```bash
rsync -avz --delete perforce@master_server:/p4/1/depots/ /p4/1/depots
```

Or verify the depot files are up to date:

```bash
/p4/common/bin/p4master_run 1 /p4/common/bin/p4verify.sh
```

## Note 2: Master Server is Inoperable

If the master server is inoperable, follow these steps:

1. On the replica server, run:
   ```bash
   /p4/1/bin/p4d_1_init stop
   ```

2. Continue with the steps in [Section 2.4: Stop the Replica](#24-stop-the-replica) above.
# Change User Description Committed
#1 32388 Russell C. Jackson (Rusty) Updates using Claude.ai to clean up the code, reduce duplication, enhanace security, and use current standards.