# SDP Feature-Port Scope: russell_jackson fork vs perforce_software 2025.2

> ## OUTCOME (completed port)
> All phases below were implemented. Files touched (under `Server/Unix/p4/common/`):
>
> **New scripts copied/ported from upstream:** `bin/ccheck.sh` (+ `config/configurables.cfg`),
> `bin/check_dir_ownership.sh`, `bin/depot_verify_chunks.py`, `bin/edge_vars`,
> `bin/journal_watch.sh`, `bin/keep_offline_db_current.sh`, `bin/load_checkpoint.sh`,
> `bin/opt_perforce_sdp_backup.sh`, `bin/p4sanity_check.sh`, `bin/proxy_rotate.sh`,
> `bin/refresh_P4ROOT_from_offline_db.sh`, `bin/request_replica_checkpoint.sh`,
> `bin/run_if_broker.sh`, `bin/run_if_proxy.sh`, `bin/sdp_health_check.sh`, `bin/verify_sdp.sh`.
>
> **Patched existing fork scripts:** `bin/backup_functions.sh` (copy_jd_table, remove_jd_tables,
> get_target_config_value, rsync_with_preflight, copy_readonly_clients_dir, request_replica_checkpoint,
> get_latest_checkpoint_with_md5; copy_readonly wired into switch_db_files), `bin/upgrade.sh`
> (p4 storage -w + p4 upgrades polling + 2nd journal rotation), `bin/edge_dump.sh` (partitioned-storage
> tables), `bin/replica_status.sh` (archive-replication check), `bin/replica_cleanup.sh`,
> `bin/broker_rotate.sh` (bug fix), `bin/gen_default_broker_cfg.sh` (net.autotune), `bin/p4dstate.sh`
> (lslocks -J), `bin/p4login` (full modern replacement), `bin/p4_vars` (p4login support vars),
> `bin/p4d_base` (ownership preflight gate, start-only).
>
> **MUST be validated on a non-prod instance before relying on them** (high blast radius / untestable here):
> `p4login`, `p4d_base` (start gate), `verify_sdp.sh` (tune its `-skip` list for this fork's older
> layout), `load_checkpoint.sh` / `refresh_P4ROOT_from_offline_db.sh` (DB restore), `upgrade.sh`.
>
> **Known limitations:** `load_checkpoint.sh` edge-server path requires `edge_vars` (now shipped);
> `keep_offline_db_current.sh` replays from local CHECKPOINTS/JOURNALS (NFS-shared target checkpoints
> must be reachable locally) and its `replay_journals_to_offline_db` was intentionally NOT given the
> upstream `useTargetJournalPrefix` arg to avoid changing the fork's custom two-pass replay. Not ported:
> wholesale p4_vars/instance_vars env contract, systemd enforcement, mkrep.sh, edge_shelf_replicate.sh,
> p4ftpd_base, p4review2.py, p4brokerstate.sh/p4pstate.sh (ship as broken admin-edit templates upstream).

> ## POST-PORT REFINEMENTS
> - **Config migration to the config-file model:** the 15 fork-specific configurables now
>   live in `config/configurables.cfg` and are applied via `bin/ccheck.sh -fix`;
>   `setup/configure_new_server.sh` was slimmed to setup-only (it calls `ccheck.sh` for
>   configurables). `journalPrefix` in the cfg points at `journals.rotated` (fork layout).
> - **Rename:** `helix_binaries/get_helix_binaries.sh` → `p4_binaries/get_p4_binaries.sh`
>   (Perforce's P4 rebrand); FTP URL and product names left unchanged.
> - **Removed as unused:** the `backup_functions.sh` helpers `get_target_config_value` and
>   `get_latest_checkpoint_with_md5` were deleted (no dynamic journalPrefix logic remains;
>   `copy_jd_table`/`remove_jd_tables` are retained and still used).

> ## MULTI-AGENT REVIEW FIXES (submitted as Perforce change 32803)
> A multi-agent review of the full changeset confirmed: no dangling references to the
> removed functions, no leftover `helix` references, and clean `bash -n` across all scripts.
> It found 9 real issues — **all fixed**. Several were pre-existing upstream defects
> inherited by verbatim copy, so the fork is now cleaner than upstream on those.
>
> 1. **(High)** `bin/backup_functions.sh` `rsync_with_preflight`: rsync `--stats` size is in
>    BYTES (not KB) and stock GNU rsync adds comma separators, so the disk-space safety check
>    was off by ~1024× and could crash bash arithmetic. Now strips commas, converts bytes→KB,
>    defaults to 0.
> 2. **(High)** `config/configurables.cfg` `hcc|filesys.P4JOURNAL.min` and `hcc|filesys.P4LOG.min`:
>    a stray `Exact` field (8 fields instead of 7) made `ccheck -fix` bail under the `hcc`
>    profile. Removed.
> 3. **(High)** `bin/upgrade.sh`: p4d version thresholds were 2-digit (`"18.2"`/`"19.1"`) but
>    compared against 4-digit `p4d -V` output. Fixed to `"2018.2"`/`"2019.1"`.
> 4. **(Med)** `bin/upgrade.sh`: `start_p4d` was nested inside the version gate. Made
>    unconditional so p4d is never left stopped after a DB upgrade.
> 5. **(Low)** `bin/p4login`: `$JDTmpDir` was referenced unguarded under `set -u` when
>    db.config is unreadable. Guarded with a `[[ -r "$P4ROOT/db.config" ]]` check.
> 6. **(Low)** `bin/load_checkpoint.sh`: a compressed numbered journal was stored without its
>    `.gz` suffix, aborting replay. Suffix added.
> 7. **(Low)** `bin/proxy_rotate.sh`: dead `check_dirs 2` argument (fork's `check_dirs` takes
>    none) removed.
> 8. **(Low)** `p4_binaries/get_p4_binaries.sh`: long-form `YYYY.N` year whitelist stopped at
>    2024, rejecting `2025.2` (the default is `r25.2`). Added 2025/2026.
> 9. **(Low)** `config/configurables.cfg` `always|dm.user.resetpassword`: only 6 fields
>    (missing `ServerIDType`). Fixed to 7.


Scope of `Server/Unix/p4/common/bin`. Compares the fork against upstream
Rev. SDP/MultiArch/2025.2/32234. **Already ported:** partitioned/readonly client
directory handling (`rsync_with_preflight` + `copy_readonly_clients_dir`, wired
into `switch_db_files`).

## Guiding constraints (why this is field-level, not file-level)
- The fork is **not uniformly older**. It is independently ahead of upstream in
  several places: `printf %q` safe re-exec (`p4d_base`), array arg passing
  (`p4master_run`), `p4d -xu` pre-start upgrade, `local`-scoped `ps_functions.sh`,
  pid-protecting `kill_idle.sh`, modernized `update_limits.py`, and a
  `SERVER_TYPE`-based `run_if_*` design. A blind overwrite would REGRESS these.
- The **env contract diverges fundamentally**. The fork's `p4_vars` uses a
  hostname-qualified `P4PORT`, `SERVER_TYPE` (from `sdp_server_type.txt`),
  `JOURNALS=journals.rotated`, `RSYNCUSER`, and a `serverid.vars` model. Upstream
  uses a `db.config`-driven `instance_vars` model. **Do NOT replace `p4_vars` /
  `instance_vars` wholesale** — add individual variables only.
- The fork is intentionally **init-based, not systemd**. Do not re-enable the
  systemd-enforcement blocks.

---

## TIER 1 — Modern p4d feature support (highest value, your stated goal)

| # | Item | Upstream source | What modern p4d feature | Risk | Effort |
|---|------|-----------------|-------------------------|------|--------|
| 1 | **`p4 storage -w` + `p4 upgrades` polling** in upgrade flow | `upgrade.sh` | Waits for async db.storage upgrade (2019.1) and background upgrade tasks (2020.2+) before rotating/replaying journals. The fork's `upgrade.sh` does journaled `-xu` but does NOT wait — this is the one genuinely **dangerous** modern-p4d gap. | LOW–MED | SMALL |
| 2 | **Post-upgrade second journal rotation** | `upgrade.sh` | Rotates journal after a major upgrade so upgrade DB changes flow into offline_db. Master/commit only. | MED | SMALL |
| 3 | **`edge_vars` partitioned-storage table lists** | `edge_vars`, `edge_dump.sh`, `recover_edge.sh` | Excludes/seeds modern partitioned-storage db.* tables (`db.storagesh/sx`, `db.haveg`, `db.workingg`, `db.locksg`, `db.resolveg`) per p4d version, instead of the fork's hardcoded inline table list. | MED | MED |
| 4 | **`request_replica_checkpoint` (`p4 admin checkpoint -Z`)** | `request_replica_checkpoint.sh` + `backup_functions.sh` | journalcopy/standby checkpoint-at-next-rotation workflow, parallel `-p -m -N`. | LOW | SMALL |
| 5 | **Archive-replication health check** | `replica_status.sh` (lbr.replication) | Detects librarian/archive transfer failures (`pull -ls`, "Transfer of librarian file failed"), version-gated `pull -ljv`/`-lj`. Fork checks only metadata journal lag — archive failures go silent today. | MED | MED |
| 6 | **`keep_offline_db_current.sh`** | upstream-only script | Keeps a standby's offline_db current on NFS-shared checkpoints without full checkpoints. | MED | MED |
| 7 | **Modern `p4login`** | `p4login` (v4.6.2) | `auth.id`-aware login, service/automation users, edge `ExternalAddress`, P4AUTH/P4BROKERPORT login, SSL `p4 trust`, encrypted password file. Fork's `p4login` is a 2-line stub. Signature-compatible with callers. | MED | LARGE |
| 8 | **Proxy/broker modernization** | `p4p_base`, `gen_default_broker_cfg.sh`, `p4broker_base` | `net.autotune`/`-v track` proxy flags, SSL `p4 trust` of target/listen, multi-config broker (`*.broker.<host>.cfg`). | MED | MED |
| 9 | **`check_dir_ownership.sh` preflight** | `p4d_base` + script | Fast (`-maxdepth 1`) P4ROOT ownership check before start; explicitly designed around large partitioned-client / server.locks dirs. | LOW–MED | MED |

**Foundational helpers** (port first — they unlock #4/#5/#6 and improve safety):
`copy_jd_table()` / `remove_jd_tables()` (read db.config/db.counters from a temp
copy instead of the live DB), `get_target_config_value()` (P4TARGET→`configure show`
discovery), `get_latest_checkpoint_with_md5()`. All LOW risk / SMALL effort.

---

## TIER 2 — Robustness / quality (cheap, safe, mostly standalone)

- **`p4sanity_check.sh`** — service smoke test. Standalone (sources only `p4_vars`). LOW/SMALL.
- **`sdp_health_check.sh`** — version-agnostic health report; low entanglement by design. **Better target than `verify_sdp.sh`.** LOW–MED/MED.
- **`check_dir_ownership.sh`** (standalone) — wrong-owner detector after bad restore/rsync. LOW/SMALL.
- **`journal_watch.sh`** — journal-partition free-space watcher + auto-rotate/mail. LOW–MED/SMALL.
- **`depot_verify_chunks.py`** — chunk huge depots for parallel `p4 verify`. Complements fork's `cron_verify.sh`/`p4verify.py`. Needs P4Python. LOW/SMALL–MED.
- **`broker_rotate.sh` bug fix** — fork hardcodes `P4PORT=1666` and calls a meaningless `get_journalnum` on broker hosts. Real defect; fix regardless. LOW/SMALL.
- **`replica_cleanup.sh`** — add `-service` login, disk-space check, mail. LOW/SMALL.
- **`run_if_broker.sh` / `run_if_proxy.sh`** — fork lacks these gates. LOW/SMALL.
- **`proxy_rotate.sh`, `p4brokerstate.sh`, `p4pstate.sh`** — proxy/broker log-rotation + state-capture diagnostics. LOW/SMALL.
- **`p4dstate.sh`** — add `lslocks -J` (JSON, machine-parseable) capture. LOW/SMALL.
- **New additive instance vars** — `SDP_MAX_START/STOP_DELAY_*`, `SDP_ALWAYS_LOGIN`, `SDP_AUTOMATION_USERS`, `SDP_VERSION`, `SDP_ADMIN_PASSWORD_FILE` (only wire up alongside their consumers). LOW/SMALL.

---

## TIER 3 — High value but heavy / entangled (decide case-by-case)

- **`verify_sdp.sh`** (87KB) — full SDP self-verification harness; biggest single capability gap but tightly bound to the 2025.2 layout/configurables. HIGH/LARGE. Prefer `sdp_health_check.sh` first.
- **Full `upgrade.sh` orchestrator** (1576 vs 169 lines) — 5-phase preflight-driven multi-binary upgrade. Pulls in `verify_sdp.sh` + `get_helix_binaries.sh`. HIGH/LARGE. (But the cheap subset — Tier 1 #1/#2 — captures most of the safety benefit.)
- **`refresh_P4ROOT_from_offline_db.sh` / `load_checkpoint.sh`** — modern swap/restore tools (parallel + compressed checkpoint aware). Overlap with fork's `recreate_db_*` lineage. HIGH/LARGE.
- **`ccheck.sh` + `configurables.cfg`** — config-drift / security audit. Needs the baseline data file ported too. MED/MED.
- **`opt_perforce_sdp_backup.sh`** — DR backup of the SDP tooling layer itself. MED/MED–LARGE.

---

## DO NOT port
- Wholesale `p4_vars` / `instance_vars` (breaks fork's P4PORT/SERVER_TYPE/JOURNALS contract).
- systemd enforcement blocks (fork is intentionally init-based).
- Wholesale `db.config`-driven replica resolution.
- Whole-file overwrite of `p4d_base`, `p4master_run`, `ps_functions.sh`, `kill_idle.sh`, `update_limits.py` (fork is ahead in places — merge fields only).
- `mkrep.sh` (fork's `mkstandby*`/`mkedge*` cover it), `edge_shelf_replicate.sh` (obsolete on modern p4d), `p4ftpd_base` (obsolete), `p4review2.py` (Swarm supersedes).

---

## Recommended phased order
1. **Phase 0 (foundational, LOW/SMALL):** `copy_jd_table`/`remove_jd_tables`,
   `get_target_config_value`, `get_latest_checkpoint_with_md5` into `backup_functions.sh`.
2. **Phase 1 (critical correctness):** Tier 1 #1 + #2 (upgrade.sh storage/upgrades polling).
3. **Phase 2 (cheap robustness):** Tier 2 standalone scripts + `broker_rotate.sh` fix.
4. **Phase 3 (replica/standby features):** Tier 1 #4, #5, #6, #3.
5. **Phase 4 (larger):** Tier 1 #7 (p4login), #8 (proxy/broker), #9; then Tier 3 as desired.

---

## SNAPSHOT-BASED LIVE CHECKPOINT (later enhancement; submitted as change 32806)

Added to `bin/live_checkpoint.sh` + `bin/backup_functions.sh` (config in `bin/p4_vars`). Goal:
shrink the live-server outage of a live checkpoint from the WHOLE checkpoint dump (the `p4d -jc`
lock, minutes–hours) to the few seconds needed to take a consistent snapshot of P4ROOT, then build
the checkpoint FROM the snapshot offline.

**Flow (`snapshot_checkpoint`, master-only, parallel-aware):** build+validate provider create command
→ rotate journal (non-fatal) → `p4d -r $P4ROOT -c "<create>"` (p4d "lock tables, run command, unlock"
= consistent snapshot, brief lock) → expose snapshot as a readable root → `dump_checkpoint_from_root`
(offline) → guaranteed teardown. On any failure it returns non-zero and `live_checkpoint.sh` falls
back to the in-place `checkpoint()`.

**Methods** (`detect_snapshot_method`, precedence, override `SNAPSHOT_METHOD=auto|reflink|aws|azure|gcp|off`):
1. **reflink** — local copy-on-write clone of `db.*` (XFS reflink=1 / btrfs). Fully local, the
   primary/testable path; no config.
2. **aws / azure / gcp** — create a volume snapshot, materialize a temp volume from it, attach, mount,
   checkpoint, then detach/delete. Config-driven (`SNAPSHOT_*` vars in `p4_vars`); needs the provider
   CLI + an instance role with snapshot/volume permissions, plus root/sudo for mount.
3. fall back to the in-place `checkpoint()`.

**Supporting changes:** refactored `dump_checkpoint` → `dump_checkpoint_from_root <root>` (reused by the
snapshot path; preserves the `SNAPSHOT_SCRIPT` hook). New helpers: `detect_snapshot_method`,
`snapshot_{create_script,expose,destroy}_<method>`, `snapshot_rotate_journal` (non-fatal rotate),
`snapshot_wait_for_device`, `snapshot_build_create_script`, `snapshot_expose`, `snapshot_cleanup`,
`snapshot_checkpoint`, `_snapshot_checkpoint_run`.

**Multi-agent review fixes (in the same change):** mode-aware checkpoint existence-skip (parallel `-jdpm`
vs serial `.gz`); half-written-checkpoint guard restored; `SNAPSHOT_METHOD` validation + `declare -F`
guard; cloud teardown subshell-leak fixed (expose sets a global `SnapRoot`, runs in-shell so
volume/mount state persists for cleanup); non-fatal journal rotation so failures fall back instead of
aborting; build/validate before rotating; bounded device-wait poll instead of a fixed sleep.

**MUST validate on a non-prod instance:** the journal-boundary consistency (rebuild offline_db from a
snapshot checkpoint, replay journals, `p4 verify` vs. a control checkpoint) and the cloud paths'
environment-specific config (volume IDs, device naming incl. AWS Nitro nvme, mount privilege). The
rotate→`p4d -c`-lock transaction gap is a documented, accepted residual risk — run during a quiet window.