# SDP Feature-Port Scope: russell_jackson fork vs perforce_software 2025.2 > ## OUTCOME (completed port) > All phases below were implemented. Files touched (under `Server/Unix/p4/common/`): > > **New scripts copied/ported from upstream:** `bin/ccheck.sh` (+ `config/configurables.cfg`), > `bin/check_dir_ownership.sh`, `bin/depot_verify_chunks.py`, `bin/edge_vars`, > `bin/journal_watch.sh`, `bin/keep_offline_db_current.sh`, `bin/load_checkpoint.sh`, > `bin/opt_perforce_sdp_backup.sh`, `bin/p4sanity_check.sh`, `bin/proxy_rotate.sh`, > `bin/refresh_P4ROOT_from_offline_db.sh`, `bin/request_replica_checkpoint.sh`, > `bin/run_if_broker.sh`, `bin/run_if_proxy.sh`, `bin/sdp_health_check.sh`, `bin/verify_sdp.sh`. > > **Patched existing fork scripts:** `bin/backup_functions.sh` (copy_jd_table, remove_jd_tables, > get_target_config_value, rsync_with_preflight, copy_readonly_clients_dir, request_replica_checkpoint, > get_latest_checkpoint_with_md5; copy_readonly wired into switch_db_files), `bin/upgrade.sh` > (p4 storage -w + p4 upgrades polling + 2nd journal rotation), `bin/edge_dump.sh` (partitioned-storage > tables), `bin/replica_status.sh` (archive-replication check), `bin/replica_cleanup.sh`, > `bin/broker_rotate.sh` (bug fix), `bin/gen_default_broker_cfg.sh` (net.autotune), `bin/p4dstate.sh` > (lslocks -J), `bin/p4login` (full modern replacement), `bin/p4_vars` (p4login support vars), > `bin/p4d_base` (ownership preflight gate, start-only). > > **MUST be validated on a non-prod instance before relying on them** (high blast radius / untestable here): > `p4login`, `p4d_base` (start gate), `verify_sdp.sh` (tune its `-skip` list for this fork's older > layout), `load_checkpoint.sh` / `refresh_P4ROOT_from_offline_db.sh` (DB restore), `upgrade.sh`. > > **Known limitations:** `load_checkpoint.sh` edge-server path requires `edge_vars` (now shipped); > `keep_offline_db_current.sh` replays from local CHECKPOINTS/JOURNALS (NFS-shared target checkpoints > must be reachable locally) and its `replay_journals_to_offline_db` was intentionally NOT given the > upstream `useTargetJournalPrefix` arg to avoid changing the fork's custom two-pass replay. Not ported: > wholesale p4_vars/instance_vars env contract, systemd enforcement, mkrep.sh, edge_shelf_replicate.sh, > p4ftpd_base, p4review2.py, p4brokerstate.sh/p4pstate.sh (ship as broken admin-edit templates upstream). > ## POST-PORT REFINEMENTS > - **Config migration to the config-file model:** the 15 fork-specific configurables now > live in `config/configurables.cfg` and are applied via `bin/ccheck.sh -fix`; > `setup/configure_new_server.sh` was slimmed to setup-only (it calls `ccheck.sh` for > configurables). `journalPrefix` in the cfg points at `journals.rotated` (fork layout). > - **Rename:** `helix_binaries/get_helix_binaries.sh` → `p4_binaries/get_p4_binaries.sh` > (Perforce's P4 rebrand); FTP URL and product names left unchanged. > - **Removed as unused:** the `backup_functions.sh` helpers `get_target_config_value` and > `get_latest_checkpoint_with_md5` were deleted (no dynamic journalPrefix logic remains; > `copy_jd_table`/`remove_jd_tables` are retained and still used). > ## MULTI-AGENT REVIEW FIXES (submitted as Perforce change 32803) > A multi-agent review of the full changeset confirmed: no dangling references to the > removed functions, no leftover `helix` references, and clean `bash -n` across all scripts. > It found 9 real issues — **all fixed**. Several were pre-existing upstream defects > inherited by verbatim copy, so the fork is now cleaner than upstream on those. > > 1. **(High)** `bin/backup_functions.sh` `rsync_with_preflight`: rsync `--stats` size is in > BYTES (not KB) and stock GNU rsync adds comma separators, so the disk-space safety check > was off by ~1024× and could crash bash arithmetic. Now strips commas, converts bytes→KB, > defaults to 0. > 2. **(High)** `config/configurables.cfg` `hcc|filesys.P4JOURNAL.min` and `hcc|filesys.P4LOG.min`: > a stray `Exact` field (8 fields instead of 7) made `ccheck -fix` bail under the `hcc` > profile. Removed. > 3. **(High)** `bin/upgrade.sh`: p4d version thresholds were 2-digit (`"18.2"`/`"19.1"`) but > compared against 4-digit `p4d -V` output. Fixed to `"2018.2"`/`"2019.1"`. > 4. **(Med)** `bin/upgrade.sh`: `start_p4d` was nested inside the version gate. Made > unconditional so p4d is never left stopped after a DB upgrade. > 5. **(Low)** `bin/p4login`: `$JDTmpDir` was referenced unguarded under `set -u` when > db.config is unreadable. Guarded with a `[[ -r "$P4ROOT/db.config" ]]` check. > 6. **(Low)** `bin/load_checkpoint.sh`: a compressed numbered journal was stored without its > `.gz` suffix, aborting replay. Suffix added. > 7. **(Low)** `bin/proxy_rotate.sh`: dead `check_dirs 2` argument (fork's `check_dirs` takes > none) removed. > 8. **(Low)** `p4_binaries/get_p4_binaries.sh`: long-form `YYYY.N` year whitelist stopped at > 2024, rejecting `2025.2` (the default is `r25.2`). Added 2025/2026. > 9. **(Low)** `config/configurables.cfg` `always|dm.user.resetpassword`: only 6 fields > (missing `ServerIDType`). Fixed to 7. Scope of `Server/Unix/p4/common/bin`. Compares the fork against upstream Rev. SDP/MultiArch/2025.2/32234. **Already ported:** partitioned/readonly client directory handling (`rsync_with_preflight` + `copy_readonly_clients_dir`, wired into `switch_db_files`). ## Guiding constraints (why this is field-level, not file-level) - The fork is **not uniformly older**. It is independently ahead of upstream in several places: `printf %q` safe re-exec (`p4d_base`), array arg passing (`p4master_run`), `p4d -xu` pre-start upgrade, `local`-scoped `ps_functions.sh`, pid-protecting `kill_idle.sh`, modernized `update_limits.py`, and a `SERVER_TYPE`-based `run_if_*` design. A blind overwrite would REGRESS these. - The **env contract diverges fundamentally**. The fork's `p4_vars` uses a hostname-qualified `P4PORT`, `SERVER_TYPE` (from `sdp_server_type.txt`), `JOURNALS=journals.rotated`, `RSYNCUSER`, and a `serverid.vars` model. Upstream uses a `db.config`-driven `instance_vars` model. **Do NOT replace `p4_vars` / `instance_vars` wholesale** — add individual variables only. - The fork is intentionally **init-based, not systemd**. Do not re-enable the systemd-enforcement blocks. --- ## TIER 1 — Modern p4d feature support (highest value, your stated goal) | # | Item | Upstream source | What modern p4d feature | Risk | Effort | |---|------|-----------------|-------------------------|------|--------| | 1 | **`p4 storage -w` + `p4 upgrades` polling** in upgrade flow | `upgrade.sh` | Waits for async db.storage upgrade (2019.1) and background upgrade tasks (2020.2+) before rotating/replaying journals. The fork's `upgrade.sh` does journaled `-xu` but does NOT wait — this is the one genuinely **dangerous** modern-p4d gap. | LOW–MED | SMALL | | 2 | **Post-upgrade second journal rotation** | `upgrade.sh` | Rotates journal after a major upgrade so upgrade DB changes flow into offline_db. Master/commit only. | MED | SMALL | | 3 | **`edge_vars` partitioned-storage table lists** | `edge_vars`, `edge_dump.sh`, `recover_edge.sh` | Excludes/seeds modern partitioned-storage db.* tables (`db.storagesh/sx`, `db.haveg`, `db.workingg`, `db.locksg`, `db.resolveg`) per p4d version, instead of the fork's hardcoded inline table list. | MED | MED | | 4 | **`request_replica_checkpoint` (`p4 admin checkpoint -Z`)** | `request_replica_checkpoint.sh` + `backup_functions.sh` | journalcopy/standby checkpoint-at-next-rotation workflow, parallel `-p -m -N`. | LOW | SMALL | | 5 | **Archive-replication health check** | `replica_status.sh` (lbr.replication) | Detects librarian/archive transfer failures (`pull -ls`, "Transfer of librarian file failed"), version-gated `pull -ljv`/`-lj`. Fork checks only metadata journal lag — archive failures go silent today. | MED | MED | | 6 | **`keep_offline_db_current.sh`** | upstream-only script | Keeps a standby's offline_db current on NFS-shared checkpoints without full checkpoints. | MED | MED | | 7 | **Modern `p4login`** | `p4login` (v4.6.2) | `auth.id`-aware login, service/automation users, edge `ExternalAddress`, P4AUTH/P4BROKERPORT login, SSL `p4 trust`, encrypted password file. Fork's `p4login` is a 2-line stub. Signature-compatible with callers. | MED | LARGE | | 8 | **Proxy/broker modernization** | `p4p_base`, `gen_default_broker_cfg.sh`, `p4broker_base` | `net.autotune`/`-v track` proxy flags, SSL `p4 trust` of target/listen, multi-config broker (`*.broker..cfg`). | MED | MED | | 9 | **`check_dir_ownership.sh` preflight** | `p4d_base` + script | Fast (`-maxdepth 1`) P4ROOT ownership check before start; explicitly designed around large partitioned-client / server.locks dirs. | LOW–MED | MED | **Foundational helpers** (port first — they unlock #4/#5/#6 and improve safety): `copy_jd_table()` / `remove_jd_tables()` (read db.config/db.counters from a temp copy instead of the live DB), `get_target_config_value()` (P4TARGET→`configure show` discovery), `get_latest_checkpoint_with_md5()`. All LOW risk / SMALL effort. --- ## TIER 2 — Robustness / quality (cheap, safe, mostly standalone) - **`p4sanity_check.sh`** — service smoke test. Standalone (sources only `p4_vars`). LOW/SMALL. - **`sdp_health_check.sh`** — version-agnostic health report; low entanglement by design. **Better target than `verify_sdp.sh`.** LOW–MED/MED. - **`check_dir_ownership.sh`** (standalone) — wrong-owner detector after bad restore/rsync. LOW/SMALL. - **`journal_watch.sh`** — journal-partition free-space watcher + auto-rotate/mail. LOW–MED/SMALL. - **`depot_verify_chunks.py`** — chunk huge depots for parallel `p4 verify`. Complements fork's `cron_verify.sh`/`p4verify.py`. Needs P4Python. LOW/SMALL–MED. - **`broker_rotate.sh` bug fix** — fork hardcodes `P4PORT=1666` and calls a meaningless `get_journalnum` on broker hosts. Real defect; fix regardless. LOW/SMALL. - **`replica_cleanup.sh`** — add `-service` login, disk-space check, mail. LOW/SMALL. - **`run_if_broker.sh` / `run_if_proxy.sh`** — fork lacks these gates. LOW/SMALL. - **`proxy_rotate.sh`, `p4brokerstate.sh`, `p4pstate.sh`** — proxy/broker log-rotation + state-capture diagnostics. LOW/SMALL. - **`p4dstate.sh`** — add `lslocks -J` (JSON, machine-parseable) capture. LOW/SMALL. - **New additive instance vars** — `SDP_MAX_START/STOP_DELAY_*`, `SDP_ALWAYS_LOGIN`, `SDP_AUTOMATION_USERS`, `SDP_VERSION`, `SDP_ADMIN_PASSWORD_FILE` (only wire up alongside their consumers). LOW/SMALL. --- ## TIER 3 — High value but heavy / entangled (decide case-by-case) - **`verify_sdp.sh`** (87KB) — full SDP self-verification harness; biggest single capability gap but tightly bound to the 2025.2 layout/configurables. HIGH/LARGE. Prefer `sdp_health_check.sh` first. - **Full `upgrade.sh` orchestrator** (1576 vs 169 lines) — 5-phase preflight-driven multi-binary upgrade. Pulls in `verify_sdp.sh` + `get_helix_binaries.sh`. HIGH/LARGE. (But the cheap subset — Tier 1 #1/#2 — captures most of the safety benefit.) - **`refresh_P4ROOT_from_offline_db.sh` / `load_checkpoint.sh`** — modern swap/restore tools (parallel + compressed checkpoint aware). Overlap with fork's `recreate_db_*` lineage. HIGH/LARGE. - **`ccheck.sh` + `configurables.cfg`** — config-drift / security audit. Needs the baseline data file ported too. MED/MED. - **`opt_perforce_sdp_backup.sh`** — DR backup of the SDP tooling layer itself. MED/MED–LARGE. --- ## DO NOT port - Wholesale `p4_vars` / `instance_vars` (breaks fork's P4PORT/SERVER_TYPE/JOURNALS contract). - systemd enforcement blocks (fork is intentionally init-based). - Wholesale `db.config`-driven replica resolution. - Whole-file overwrite of `p4d_base`, `p4master_run`, `ps_functions.sh`, `kill_idle.sh`, `update_limits.py` (fork is ahead in places — merge fields only). - `mkrep.sh` (fork's `mkstandby*`/`mkedge*` cover it), `edge_shelf_replicate.sh` (obsolete on modern p4d), `p4ftpd_base` (obsolete), `p4review2.py` (Swarm supersedes). --- ## Recommended phased order 1. **Phase 0 (foundational, LOW/SMALL):** `copy_jd_table`/`remove_jd_tables`, `get_target_config_value`, `get_latest_checkpoint_with_md5` into `backup_functions.sh`. 2. **Phase 1 (critical correctness):** Tier 1 #1 + #2 (upgrade.sh storage/upgrades polling). 3. **Phase 2 (cheap robustness):** Tier 2 standalone scripts + `broker_rotate.sh` fix. 4. **Phase 3 (replica/standby features):** Tier 1 #4, #5, #6, #3. 5. **Phase 4 (larger):** Tier 1 #7 (p4login), #8 (proxy/broker), #9; then Tier 3 as desired. --- ## SNAPSHOT-BASED LIVE CHECKPOINT (later enhancement; submitted as change 32806) Added to `bin/live_checkpoint.sh` + `bin/backup_functions.sh` (config in `bin/p4_vars`). Goal: shrink the live-server outage of a live checkpoint from the WHOLE checkpoint dump (the `p4d -jc` lock, minutes–hours) to the few seconds needed to take a consistent snapshot of P4ROOT, then build the checkpoint FROM the snapshot offline. **Flow (`snapshot_checkpoint`, master-only, parallel-aware):** build+validate provider create command → rotate journal (non-fatal) → `p4d -r $P4ROOT -c ""` (p4d "lock tables, run command, unlock" = consistent snapshot, brief lock) → expose snapshot as a readable root → `dump_checkpoint_from_root` (offline) → guaranteed teardown. On any failure it returns non-zero and `live_checkpoint.sh` falls back to the in-place `checkpoint()`. **Methods** (`detect_snapshot_method`, precedence, override `SNAPSHOT_METHOD=auto|reflink|aws|azure|gcp|off`): 1. **reflink** — local copy-on-write clone of `db.*` (XFS reflink=1 / btrfs). Fully local, the primary/testable path; no config. 2. **aws / azure / gcp** — create a volume snapshot, materialize a temp volume from it, attach, mount, checkpoint, then detach/delete. Config-driven (`SNAPSHOT_*` vars in `p4_vars`); needs the provider CLI + an instance role with snapshot/volume permissions, plus root/sudo for mount. 3. fall back to the in-place `checkpoint()`. **Supporting changes:** refactored `dump_checkpoint` → `dump_checkpoint_from_root ` (reused by the snapshot path; preserves the `SNAPSHOT_SCRIPT` hook). New helpers: `detect_snapshot_method`, `snapshot_{create_script,expose,destroy}_`, `snapshot_rotate_journal` (non-fatal rotate), `snapshot_wait_for_device`, `snapshot_build_create_script`, `snapshot_expose`, `snapshot_cleanup`, `snapshot_checkpoint`, `_snapshot_checkpoint_run`. **Multi-agent review fixes (in the same change):** mode-aware checkpoint existence-skip (parallel `-jdpm` vs serial `.gz`); half-written-checkpoint guard restored; `SNAPSHOT_METHOD` validation + `declare -F` guard; cloud teardown subshell-leak fixed (expose sets a global `SnapRoot`, runs in-shell so volume/mount state persists for cleanup); non-fatal journal rotation so failures fall back instead of aborting; build/validate before rotating; bounded device-wait poll instead of a fixed sleep. **MUST validate on a non-prod instance:** the journal-boundary consistency (rebuild offline_db from a snapshot checkpoint, replay journals, `p4 verify` vs. a control checkpoint) and the cloud paths' environment-specific config (volume IDs, device naming incl. AWS Nitro nvme, mount privilege). The rotate→`p4d -c`-lock transaction gap is a documented, accepted residual risk — run during a quiet window.