p4checkpoint.in #6

#!@@Perl@@ -w

# Copyright (c) 2002 Anders Johnson <anders@ieee.org>. All rights
# reserved. This program is free software; you can redistribute it
# and/or modify it under the same terms as Perl itself.

=head1 NAME

@@Name@@ - routine maintenance for the p4(1) database

=head1 SYNOPSIS

B<@@Name@@> [B<-a> F<days>] [B<-k> F<keep>] [B<-L> F<logfile>] [B<-p> F<port>] [B<-q>|B<-Q>] [B<-s> F<percent>] [B<-u> F<user>] F<p4-root>

B<@@Name@@> B<-v>

=head1 DESCRIPTION

Performs routine periodic maintenance of the p4(1) database in a manner that
is suitable for invocation from a crontab(1) command.

F<p4-root> denotes the root directory of the p4 database.
If it is unspecified, then the value of the B<P4ROOT> environment variable
is used, or an error is raised if it is undefined.

The following tasks are performed in order:

=over 3

=item 1.

The database integrity is verified via B<p4 verify -qu //...>.

=item 2.

Stale journal files are reported, and the time of the last checkpoint is
determined from the modification time of the most recent checkpoint file.

=item 3.

The database is checkpointed via B<p4d -jc>.

=item 4.

The disk space usage on the p4 database volume is checked via B<df>,
and a warning is reported if it exceeds a threshold.

=item 5.

Any messages in the p4 error log more recent than the last checkpoint are
reported to STDOUT.

=item 6.

Any checkpoint files that exceed both an age threshold and a
number-of-checkpoints-ago threshold are removed, along with any journal
files that would otherwise become stale as a result.

=back

If a potentially serious problem is discovered at any stage, then messages
are send to STDERR and processing terminates.
That way, you can fixed any detected corruption before it propagates.

If there are no problems to report, then F<@@Name@@> does not produce
any output.
This is usually what you want, because then it won't generate annoying
e-mail when you run it as a crontab(1) command.
If you want to confirm that maintenance is actually occurring, then you
can look at the modification time of checkpoint files in the p4 database.
If you really want confirmation every time that it runs, then you can
make your crontab(5) entry look like this:

	0 2 * * 1-5 @@Name@@ ; echo "@@Name@@ done"

=head1 OPTIONS

=over 4

=item B<-a> F<days>

Set the minimum age of removed checkpoint files to F<days> days.
Default is 7.

=item B<-k> F<keep>

Set the minimum number of saved checkpoint files to F<keep>.
Default is 3.

=item B<-L> F<log>

Look in file F<log> for server errors. Default is "F<p4-root>/errlog".
Set it to "" to disable examining the log.

=item B<-p> F<port>

Override the B<P4PORT> environment variable with F<port>.

=item B<-q>

Quiet mode. Any errors in the p4 error log containing the message "Connection
with partner closed unexpectedly" are ignored, because they are generally
caused by exceptional but innocuous events, such as user-interrupt.

=item B<-Q>

Really quiet mode.
Ignores p4 error log messages containing "write: socket: Broken pipe",
because they are generally caused by a P4 Client API callback throwing an
exception while the server is trying to talk to the client.
Unfortunately, such a message might instead indicate real trouble, but
if the P4 Client API is in use at your site, you might get too many of
these errors to investigate them.
Implies B<-q>.

=item B<-s> F<percent>

Set the maximum disk usage of the volume containing the p4 database to
F<percent> percent.
A warning will be generated if this threshold is exceeded.
Default is 50%.

B<CAUTION:> It is recommended that you avoid increasing F<percent> well
beyond 50%.
While it is not normal for the size of the p4 database to double overnight,
it is known to happen occasionally when naive users do naive things, such
as making their own copy of everything.
You'll probably have to undo this anyway when it happens, but it's nice if
other people still get to do useful work in the meantime.
Similarly, it is a bad idea to use the volume on which the p4 database
resides for general purpose storage as well, because it's really undesirable
to lose access to the source control system just because somebody's program
went nuts and filled the disk.

=item B<-u> F<user>

Run B<p4d> and remove checkpoints as user F<user>.
This is useful when the user that runs the B<p4d> server is not a licensed
p4 user.
The user who invokes F<@@Name@@> is still required to be a licensed p4
user.
F</etc/sudoers> must allow the user invoking F<@@Name@@> to
"B</bin/su> F<user> B<-c> ...".
This option isn't suitable for running from within a crontab(1) command
unless F</etc/sudoers> specifies NOPASSWD for this operation, because there is
no tty through which to authenticate the user.

=item B<-v>

Print the version number and exit.

=back

=head1 BACKUPS

You'll get the greatest benefit from performing p4 maintenance if it is
coordinated with the periodic system backups.

Ideally, the system backups should occur immediately after p4 maintenance,
and the current F<journal> file should be backed up I<before> any versioned
data files.
If this is the case, then you'll always be able to restore to the point of
the previous system backup of the F<journal> file.
Otherwise, you might have to back up as far as the last checkpoint file
that was completed before the most recent system backup started.

As a rule, you should set I<both> the B<-a> and B<-k> options to guarantee
that every checkpoint winds up on at least 2 system backups, assuming that
the system backup policy is adhered to.
This provides reasonable protection from errors in the backup media, as
well as from occasional violations of the backup policy.

=head1 CAVEATS

=over 2

=item *

There isn't a way to specify a checkpoint/journal prefix.
Since journal and checkpoint files therefore reside on the same volume as
the database itself, they are vulnerable to a single-point failure.
This isn't really an issue, because the versioned data files are always
vulnerable to a single-point failure anyway, and the journal and checkpoint
files are useless without them.
The point of all this is that checkpointing and journaling provide a means
of recovery that is amenable to backups performed while the database is being
modified, but they are not really useful for hiding failures completely.
You need to invest in a RAID file system if that is a requirement.

=item *

The F<journal.n> files are basically useless, but they might come in handy
in the unlikely event that the p4 database becomes corrupted for no apparent
reason (for example, due to a p4d(1) bug).
Therefore, we keep them around as long as their subsequent checkpoint file.

=back

=head1 AUTHOR

Anders Johnson <F<anders@ieee.org>>

=cut

# I know that Perl 5.0 or later is required, because references are used.
# It has been tested with 5.006 (aka 5.6.0).
require 5.0;

use strict;

use Getopt::Long;
use IO::Dir;
use POSIX;

sub usage {
	print STDERR "usage: @@Name@@ [<options>] [<p4-root>]\n";
	exit(2);
}

sub check_status {
	my $status=shift;
	if($status>>8) {
		# Presumedly, the child process generated a message already.
		exit($status>>8);
	}
	elsif($status) {
		my $sig=$status & 0x7f;
		die("Signal $sig\n");
	}
}

sub mysys {
	local($!)=0;
	my $status=system(@_);
	die("Failed to exec @_ because $!\n") if $!;
	check_status($status);
}

my $days=7;
my $keep=3;
my $space=50;
my $log;
my $quiet;
my $Quiet;
my $su;
my $p4port;
Getopt::Long::Configure("bundling");
GetOptions(
  'a|age=i' => \$days,
  'k|keep=i' => \$keep,
  'L|log=s' => \$log,
  'p|port=s' => \$p4port,
  'q|quiet' => \$quiet,
  'Q|really-quiet' => \$Quiet,
  's|space=i' => \$space,
  'u|user=s' => \$su,
  'v|version' => sub {
  	print "@@Name@@ @@Version@@\n";
	exit(0);
  }
) || usage();
$quiet=1 if $Quiet;
my $age=86400 * $days;
if($p4port) {
	$p4port="-p $p4port ";
}
else {
	$p4port="";
}

if($keep<1) {
	print STDERR "-k value must be at least 1\n";
	usage();
}

if($space<0 || $space>99) {
	print STDERR "-s value must be between 0 and 99\n";
	usage();
}

my $p4root;
if(@ARGV) {
	$p4root=shift;
}
else {
	$p4root=$ENV{P4ROOT} ||
	  die("Environment variable P4ROOT is undefined.\n");
}
if(@ARGV) {
	usage();
}
$log="$p4root/errlog" unless defined $log;

# Check the database integrity, and quit if there is a problem.
open(P4, "@@P4@@ ${p4port}verify -qu //... 2>&1 |") ||
  die("Failed to exec p4 verify because $!");
while(<P4>) {
	unless($_ eq "//... - file(s) already have digests.\n") {
		print STDERR $_;
	}
}
$!=0;
close(P4);
die("Failed to close p4 verify because $!") if $!;
check_status($?);

# Find out when we last checkpointed.
my $jtime=0;
my @checkpoints;
{
	my $d=new IO::Dir "$p4root";
	my %young;
	my @journals;
	my $max_mtime=time()-$age;
	die("Couldn't list $p4root because $!") unless defined($d);
	while(defined($_=$d->read)) {
		if(/checkpoint\.(\d+)/) {
			my $n=$1;
			my @stat=stat("$p4root/$_")
			  or die("fstat $p4root/$_ failed because $!");
			my $mtime=$stat[9];
			if($mtime > $jtime) {
				$jtime = $mtime;
			}
			if($mtime > $max_mtime) {
				$young{$n}=1;
			}
			push(@checkpoints, $n);
		}
		if(/journal\.(\d+)/) {
			push(@journals, $1);
		}
	}
	my %checkpoints;
	@checkpoints{@checkpoints}=$[..$#checkpoints;
	for my $j (@journals) {
		unless(exists $checkpoints{$j+1}) {
			warn("Stale journal: journal.$j\n");
		}
	}
	my @tmp=sort { $a <=> $b } @checkpoints;
	for(2..$keep) { pop(@tmp); }
	@checkpoints=map { exists($young{$_}) ? () : ($_) } @tmp;
}

# How to prefix commands to run as the Perforce pseudo-user:
my $sudo=defined($su) ?
  sub {
	return "sudo /bin/su $su -c \"@_\"";
  }
:
  sub {
	return "@_";
  };

# Journal the database.
open(P4D, &$sudo("@@P4d@@ -r$p4root -jc")." |") ||
  die("Failed to run p4d because $!");
my @steps;
while(<P4D>) {
	if(/^Checkpointing to checkpoint\.\d+\.\.\.$/) {
		$steps[0]=1;
	}
	elsif(/^Saving journal to journal\.\d+\.\.\.$/) {
		$steps[1]=1;
	}
	elsif(/^Truncating journal\.\.\.$/) {
		$steps[2]=1;
	}
	else {
		print;
	}
}
$!=0;
close(P4D);
die("Failed to close p4d because $!") if $!;
check_status($?);
for my $i (0..2) {
	warn("p4d did not confirm step $i") unless $steps[$i];
}

# Check the remaining disk space.
open(DF, "df $p4root|") || die("Failed to run df because $!");
while(<DF>) {
	if(/\s(\d+)\%/) {
		if($1 > $space) {
			warn("Partition containing $p4root is $1% full.\n");
		}
	}
}
$!=0;
close(DF);
die("Failed to close df because $!") if $!;
check_status($?);

# Print any messages entered into the error log since the last checkpoint.
if($log) {
	open(LOG, "<$log") || die("Couldn't read $log because $!");
	my $msg="";
	my $go;
	my $nogo;
	my $pend="";
	while(<LOG>) {
		if(/^\S/) {
			$msg="" if $nogo;
			print $pend.$msg if $go;
			if(defined $go) {
				$pend="";
			}
			else {
				$pend.=$msg;
			}
			$msg="";
			undef $go;
			undef $nogo;
			$nogo=1 if /^Perforce server info:$/;
		}
		$msg.=$_;
		if(m|Date (\d+)/(\d+)/(\d+)\s+(\d+):(\d+):(\d+)|) {
			$go=0 unless defined $go;
			# Always assume standard time. This will result in
			# double reporting of errors posted up to an hour
			# before checkpointing during DST, but otherwise
			# errors might get lost if checkpointing occurs
			# near the fall switch-over.
			my $t=POSIX::mktime($6, $5, $4, $3, $2-1, $1-1900,
			  0, 0, 0
			);
			$go=1 if $t >= $jtime;
		}
		elsif($quiet &&
		  /^\s+Connection with partner closed unexpectedly.$/
		) {
			$nogo=1;
		}
		elsif($Quiet &&
		  /^\s+write: socket: Broken pipe$/
		) {
			$nogo=1;
		}
	}
	$msg="" if $nogo;
	print $pend.$msg if $go;
	unless(defined $go) {
		my @stat=stat LOG;
		my $t=$stat[9];
		print $pend.$msg if $t >= $jtime;
	}
	close(LOG) || die("Failed to close $log because $!");
}

# Remove the old checkpoints and journals
for my $c (@checkpoints) {
	mysys(&$sudo("rm -f $p4root/checkpoint.$c"));
	my $f="$p4root/journal.".($c-1);
	if(-e $f) {
		mysys(&$sudo("rm -f $f"));
	}
}
#	Change	User	Description
#6	2072	anders_johnson	Released p4checkpoint-0.06
#5	2024	anders_johnson	Release 0.05 of p4checkpoint.
#4	1876	anders_johnson	Release version 0.04 of p4checkpoint.
#3	1780	anders_johnson	p4checkpoint-0.03, p4tkd-0.01, p4tkmerge-0.02
#2	1628	anders_johnson	Check-in of p4checkpoint version 0.02.
#1	1458	anders_johnson	p4checkpoint-0.01