25206 | Removed logic that uses 'p4d -cset' to force the value for P4JOURNAL, and also automatic journal rotation on server startup. Added related logic to verify_sdp.sh to ensure there is one source of truth for the P4JOURNAL definition. === On Journal Rotation at Server Startup === The goal with journal rotation on server stratup is noble, to make it so any potential journal corruption *always* appears at the end of a numbered journal file, rather than being in the middle of the active journal. This can make it easier and faster to recover from journal corruption caused by sudden power loss, kernel panic, a p4d bug/crash, etc. However, the implementation causes problems (noted below). === On Forcing P4JOURNAL === The goal of forcing the value of P4JOURNAL via db.config is also noble, in that having a value anything other than the SDP standard can really wreak havoc with things. This is generally not an issue in a 'fresh' SDP install, but can be an issue (wreak havoc!) in cases where 'p4 configure' was used to set a value for P4JOURNAL that conflicts with the value defined by the SDP environment mechanism, which is in turn passed to 'p4d' on the command line. Even if the value defined differently, it should be set in to exactly one value, and exactly one mechanism. The current implementation causes problems (noted below). == Problems with setting P4JOURNAL in db.config == 1. Things Break The forced P4JOURNAL set via 'p4d -cset' causes a mild form of journal corruption that breaks 'standby' replicas using journalcopy, as this type of replica is extremely sensitive to the contents of every byte in the journal file, and doesn't allow for use of 'p4d -cset' to modify the P4JOURNAL file. While it does not cause any actual loss of data, it does require manual reset to fix things. In the case of a site-wide topology with a mandatory standby replica, it causes global replication to stall. 2. Not our Place (not the place of SDP scripts) Based on the above and taking a step back, I think this script behavior of forcing a back-door journal rotation is simply too intrusive for what SDP scritps should be allowed to do. They live to have some understanding of p4d workings, but shoulnd't pretend to have too much insight into the inner workings of p4d. == Problem with Always-On Journal Rotation on Start == 1. What the wah? This confuses admins by incrementing the journal counter unexpectedly. In Battle School training classes, for example, students (Perforce admins) are confused by seemingly random journal incrementing. While this could be documented and trained for, it violates the principal of least surprise, and is not typical 'p4d' beavhior. 2. Always vs. Rare It rotates the journal even when there is no corruption, which of course 99.99999% or more of the time at any given site. Anyone who has been through a corruption scenario is happy to have the corruption at the end rather than in the middle of a journal file -- as noted, the intent here is noble. But before we do any journal rotations, we should detect whether there is corruption. Turns out we have a means to detect journal corruption at the end of the current/active journal file, and should employ such detection and handle it in some approrpaite manner, e.g. by expanding the 'force_start' logic in this p4d_base init script. Journal corrption detection and preliminary handling may be added in a future SDP release. When the journal is truly corrupted, global replication will stall in any case, so measure like journal file rotation may be called for in that scenario. 3. Accelerated Deletion of Backups Increased journal counter rotations result in unexpectedly fast removal of backups. Admins are used to thinking that roughly, "one journal rotation is roughly one day." Settings like KEEPLOGS, KEEPCKPS, and KEEPJNLS trigger off the number of journal rotatations, not the number of actual calendar days. Now, I think it's OK that journal rotations and days don't match precisely. In a typical "big deal" maintenance window, for example, there might be an additional 1-3 journal rotations induced by extra checkpoints or journals being created over the course of a maintenance activity. But there may be a dozen more 'p4d' restarts during sanity testing and playing around with things. With the current logic, each restart causes another journal rotation. By the end fo the weekend, your next call to daily_checkpoint might remove more of your recent backups than you'd like or expect. (A long standing safety feature always preserves the last few, but still we don't want to delete more than desired.) === Foor for Thougt: KEEP* = numbrer of days? === Making it so KEEPLOGS/KEEPJNLS/KEEPCKPS mean literally number of days rather than journal rotations is worthy of consideration. That's beyond the scope of this change though. #review @robert_cowham @josh |