Removed logic that uses 'p4d -cset' to force the value for
P4JOURNAL, and also automatic journal rotation on server
startup.
Added related logic to verify_sdp.sh to ensure there is
one source of truth for the P4JOURNAL definition.
=== On Journal Rotation at Server Startup ===
The goal with journal rotation on server stratup is noble, to
make it so any potential journal corruption *always* appears
at the end of a numbered journal file, rather than being in
the middle of the active journal. This can make it easier and
faster to recover from journal corruption caused by sudden power
loss, kernel panic, a p4d bug/crash, etc.
However, the implementation causes problems (noted below).
=== On Forcing P4JOURNAL ===
The goal of forcing the value of P4JOURNAL via db.config
is also noble, in that having a value anything other than
the SDP standard can really wreak havoc with things. This
is generally not an issue in a 'fresh' SDP install, but can
be an issue (wreak havoc!) in cases where 'p4 configure' was
used to set a value for P4JOURNAL that conflicts with the
value defined by the SDP environment mechanism, which is in
turn passed to 'p4d' on the command line. Even if the value
defined differently, it should be set in to exactly one value,
and exactly one mechanism.
The current implementation causes problems (noted below).
== Problems with setting P4JOURNAL in db.config ==
1. Things Break
The forced P4JOURNAL set via 'p4d -cset' causes a mild form of
journal corruption that breaks 'standby' replicas using
journalcopy, as this type of replica is extremely sensitive to
the contents of every byte in the journal file, and doesn't
allow for use of 'p4d -cset' to modify the P4JOURNAL file.
While it does not cause any actual loss of data, it does require
manual reset to fix things. In the case of a site-wide topology
with a mandatory standby replica, it causes global replication to
stall.
2. Not our Place (not the place of SDP scripts)
Based on the above and taking a step back, I think this
script behavior of forcing a back-door journal rotation is simply
too intrusive for what SDP scritps should be allowed to do.
They live to have some understanding of p4d workings, but shoulnd't
pretend to have too much insight into the inner workings of p4d.
== Problem with Always-On Journal Rotation on Start ==
1. What the wah?
This confuses admins by incrementing the journal counter
unexpectedly. In Battle School training classes, for example,
students (Perforce admins) are confused by seemingly random
journal incrementing. While this could be documented and trained
for, it violates the principal of least surprise, and is not
typical 'p4d' beavhior.
2. Always vs. Rare
It rotates the journal even when there is no corruption,
which of course 99.99999% or more of the time at any given site.
Anyone who has been through a corruption scenario is happy to
have the corruption at the end rather than in the middle of a
journal file -- as noted, the intent here is noble. But before
we do any journal rotations, we should detect whether there is
corruption. Turns out we have a means to detect journal corruption
at the end of the current/active journal file, and should employ
such detection and handle it in some approrpaite manner, e.g.
by expanding the 'force_start' logic in this p4d_base init
script.
Journal corrption detection and preliminary handling may be added
in a future SDP release. When the journal is truly corrupted,
global replication will stall in any case, so measure like journal
file rotation may be called for in that scenario.
3. Accelerated Deletion of Backups
Increased journal counter rotations result in unexpectedly
fast removal of backups. Admins are used to thinking that
roughly, "one journal rotation is roughly one day." Settings
like KEEPLOGS, KEEPCKPS, and KEEPJNLS trigger off the number
of journal rotatations, not the number of actual calendar days.
Now, I think it's OK that journal rotations and days don't
match precisely. In a typical "big deal" maintenance window,
for example, there might be an additional 1-3 journal rotations
induced by extra checkpoints or journals being created over the
course of a maintenance activity. But there may be a dozen more
'p4d' restarts during sanity testing and playing around with things.
With the current logic, each restart causes another journal rotation.
By the end fo the weekend, your next call to daily_checkpoint might
remove more of your recent backups than you'd like or expect. (A
long standing safety feature always preserves the last few, but still
we don't want to delete more than desired.)
=== Foor for Thougt: KEEP* = numbrer of days? ===
Making it so KEEPLOGS/KEEPJNLS/KEEPCKPS mean literally number
of days rather than journal rotations is worthy of consideration.
That's beyond the scope of this change though.