SDP-353

tom_tyler (C. Thomas Tyler)
C. Thomas Tyler created this job , modified by C. Thomas Tyler
Closed
Avoid doing 'p4d -cset' for P4JOURNAL, but preserve safety feature.

Change @18686 introduced logic in p4d startup to always forcibly
set P4JOURNAL to the correct SDP-defined value, /p4/N/logs/journal,
by using 'p4d -cset' at the start of processing.  This matches
the value set in the SDP shell environment.

This was done to address a situation where, for whatever reason,
P4JOURNAL was somehow set to an incorrect value, with bad
consequences.  Apparently it was set with a 'p4 configure'
command resulting in it being stored in db.config, which
overrides the shell environment setting.

Using 'p4d -cset' generates strange and atypical entries in the
journal file, and so I want to stop doing that.  Also, I want
to avoid defining things in more than one way.

So, we want to change the logic so that we detect the
P4JOURNAL value as set in db.config first.  If it is
set, we should instead to a 'p4d -cunset' to make sure
the value for P4JOURNAL defined in the shell environment
does not get overriden.

While we are at it, we may want to also ensure that P4LOG,
and other key setting are not defined in db.config, so that they
defer to the SDP standard shell environment values.
25206Removed logic that uses 'p4d -cset' to force the value for
P4JOURNAL, and also automatic journal rotation on server
startup.

Added related logic to verify_sdp.sh to ensure there is
one source of truth for the P4JOURNAL definition.

=== On Journal Rotation at Server Startup ===

The goal with journal rotation on server stratup is noble, to
make it so any potential journal corruption *always* appears
at the end of a numbered journal file, rather than being in
the middle of the active journal.  This can make it easier and
faster to recover from journal corruption caused by sudden power
loss, kernel panic, a p4d bug/crash, etc.

However, the implementation causes problems (noted below).

=== On Forcing P4JOURNAL ===

The goal of forcing the value of P4JOURNAL via db.config
is also noble, in that having a value anything other than
the SDP standard can really wreak havoc with things.  This
is generally not an issue in a 'fresh' SDP install, but can
be an issue (wreak havoc!) in cases where 'p4 configure' was
used to set a value for P4JOURNAL that conflicts with the
value defined by the SDP environment mechanism, which is in
turn passed to 'p4d' on the command line.  Even if the value
defined differently, it should be set in to exactly one value,
and exactly one mechanism.

The current implementation causes problems (noted below).

== Problems with setting P4JOURNAL in db.config ==

1. Things  Break

The forced P4JOURNAL set via 'p4d -cset' causes a mild form of
journal corruption that breaks 'standby' replicas using
journalcopy, as this type of replica is extremely sensitive to
the contents of every byte in the journal file, and doesn't
allow for use of 'p4d -cset' to modify the P4JOURNAL file.

While it does not cause any actual loss of data, it does require
manual reset to fix things.  In the case of a site-wide topology
with a mandatory standby replica, it causes global replication to
stall.

2. Not our Place (not the place of SDP scripts)

Based on the above and taking a step back, I think this
script behavior of forcing a back-door journal rotation is simply
too intrusive for what SDP scritps should be allowed to do.
They live to have some understanding of p4d workings, but shoulnd't
pretend to have too much insight into the inner workings of p4d.

== Problem with Always-On Journal Rotation on Start ==

1. What the wah?

This confuses admins by incrementing the journal counter
unexpectedly.  In Battle School training classes, for example,
students (Perforce admins) are confused by seemingly random
journal incrementing.  While this could be documented and trained
for, it violates the principal of least surprise, and is not
typical 'p4d' beavhior.

2. Always vs. Rare

It rotates the journal even when there is no corruption,
which of course 99.99999% or more of the time at any given site.
Anyone who has been through a corruption scenario is happy to
have the corruption at the end rather than in the middle of a
journal file -- as noted, the intent here is noble.  But before
we do any journal rotations, we should detect whether there is
corruption.  Turns out we have a means to detect journal corruption
at the end of the current/active journal file, and should employ
such detection and handle it in some approrpaite manner, e.g.
by expanding the 'force_start' logic in this p4d_base init
script.

Journal corrption detection and preliminary handling may be added
in a future SDP release.  When the journal is truly corrupted,
global replication will stall in any case, so measure like journal
file rotation may be called for in that scenario.

3. Accelerated Deletion of Backups

Increased journal counter rotations result in unexpectedly
fast removal of backups.  Admins are used to thinking that
roughly, "one journal rotation is roughly one day."  Settings
like KEEPLOGS, KEEPCKPS, and KEEPJNLS trigger off the number
of journal rotatations, not the number of actual calendar days.

Now, I think it's OK that journal rotations and days don't
match precisely.  In a typical "big deal" maintenance window,
for example, there might be an additional 1-3 journal rotations
induced by extra checkpoints or journals being created over the
course of a maintenance activity.  But there may be a dozen more
'p4d' restarts during sanity testing and playing around with things.

With the current logic, each restart causes another journal rotation.
By the end fo the weekend, your next call to daily_checkpoint might
remove more of your recent backups than you'd like or expect. (A
long standing safety feature always preserves the last few, but still
we don't want to delete more than desired.)

=== Foor for Thougt: KEEP* = numbrer of days? ===

Making it so KEEPLOGS/KEEPJNLS/KEEPCKPS mean literally number
of days rather than journal rotations is worthy of consideration.
That's beyond the scope of this change though.

#review @robert_cowham @josh
  • Details
  • Comments -
Status
Closed
Project
perforce-software-sdp
Severity
C
Reported By
C. Thomas Tyler
Reported Date
Modified By
C. Thomas Tyler
Modified Date
Owned By
tom_tyler
Component
core-unix
Type
Feature