This project contains example bash scripts that show how Helix components could be monitored from a Nagios server.
The return states are configurable so could be customised for other monitoring solutions as needed:
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
check_helix_p4d_health
Description: Helix P4D health checker - Example Nagios monitoring script
Usage:
check_helix_p4d_health -p <p4port> [options] [tests]
options: [-u <p4-user>] [-T <p4-ticket-file-name>] [-P <p4-password>]
[--notips] [--comment] [--version] [--help]
tests:
[--all]
[--remoteonly]
[--licensecheck] [--licthreshold <expire-days-threshold>]
[--pidcheck] [--pidthreshold <running-p4-pid-threshold>]
[--p4monitorcheck] [--monthreshold <p4-processes-threshold>]
[--p4diskcheck] [--diskthreshold <disk-used-threshold>]
[--p4repcheck]
If no tests are specified then only the P4D server online test is run.
Multiple test arguments can be supplied. Thresholds can be supplied
with specific tests or when using '--all' or '--remoteonly'.
The '--all' flag runs all tests.
The '--remoteonly' flag runs only remote tests using the p4 client. These
tests can be run from any machine with network access to the P4D server
(including the Nagios server).
The '--message' flag allows you to prepend an alert with a custom message.
The '-u' flag specifies the P4D user name. This user must be an 'operator'
user or must have 'super' access to the P4D server
The '-p' flag specifies the P4D hostname and port in the
format 'hostname:port'.
The '-T' flag specifies the location of the tickets file.
The '-P' flag explicitly sets the password to use. Note this may not be
secure and the use of 'p4 login' and a tickets file ('-T') on the server
provides better security.
The '-t' flag specifies the location of the P4TICKETS file for SSL
connections.
The '--licensecheck' flag tests if the license file is nearing it's
expiry date. By default it checks for expiry within '30' days but this
can be overriden with the '--licthreshold' flag.
The '--p4diskcheck' flag checks for free disk space on the P4D drives using
the Perforce command 'p4 diskspace' and warns if the disks are over 95% used.
This value can be overriden by speciying a value between 0 and 99 using with
the '--diskthreshold' flag.
The '--pidcheck' flag counts the number of connected p4d processes using
'netstat' and warns if the are over 500 processes running. This value can
be overriden with the '--pidthreshold' flag. This test must be run on the
P4D server machine.
The '--p4monitorcheck' flag counts the number of commands in the
'p4 monitor' table and warns if there are over 500 running. This value can be
overriden with the '--monthreshold' flag.
The '--p4repcheck' flags (REPLICA ONLY) checks that replication is running.
Example Output:
CRITICAL: P4D server not responding!
Perforce client error:
Connect to server failed; check $P4PORT.
TCP connect to localhost:1666 failed.
connect: 127.0.0.1:1666: Connection refused
TIP:
Check if the 'p4d' process is running on the box.
Check the P4D log file for errors if it unexpectedly
stopped.
Examples:
Run all checks against server on localhost:1666
check_helix_p4d_health -p 1666 --all
Check if license will expire in next 45 days
check_helix_p4d_health -p 1666 --licensecheck -liccheck 45
If the P4D server is running using SSL then the connection must first be trusted. For example to trust connection 'ssl:localhost:1666':
export P4TRUST=/etc/nagios/.p4trust
touch $P4TRUST
chown nagios:nagios $P4TRUST
chmod 600 $P4TRUST
p4 -p ssl:localhost:1666 trust
The Nagios command then may look similar to:
check_helix_p4d_health -p 1666 --p4diskcheck -u operator -t /etc/nagios/.p4tickets
Many of the Perforce tests require the user to be logged in. The plugin allows you to specify a password with the '-P' flag but it's better practice to use a long lived ticket owned by the Nagios user on the monitored system. For example to create a ticket under '/etc/nagios/.p4tickets' for a Helix P4D user called 'operator':
export P4TICKETS=/etc/nagios/.p4tickets
touch $P4TICKETS
chown nagios:nagios $P4TICKETS
chmod 600 $P4TICKETS
p4 -p 1666 -u operator login
The Nagios command then may look similar to:
check_helix_p4d_health -p 1666 --p4diskcheck -u operator -T /etc/nagios/.p4tickets
It's also good practice that this user is an 'operator' type user to decrease the amount of power this user has. If you have any queries about Helix P4D users and security discuss this with your security team or 'support@perforce.com'.
The script can be run using SSH or NRPE. In the example below I use the NRPE plugin. Note that it requires that the NRPE service accepts arguments which can be a security risk. If you have an doubts use SSH or hard code the paramaters on the monitored server.
Below are the command definitions for the NRPE service and scripts on the Nagios server.
define command{
command_name check_nrpe
command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}
define command{
command_name check_helix_p4d_health
command_line $USER1$/check_helix_p4d_health $ARG1$
}
Below is a service definition that runs the script 'check_helix_p4d_health' use 'check_nrpe' and provides the parameters '-p localhost:1666 --licensecheck'. This can be used to check the Helix P4D license status on port 1666 on server 'master-helix-server'.
define service {
host_name master-helix-server
service_description Helix P4D 1666 License Check
check_command check_nrpe!check_helix_p4d_health!'-p localhost:1666 --licensecheck'
max_check_attempts 2
check_interval 2
retry_interval 2
check_period 24x7
check_freshness 1
contact_groups admins
notification_interval 2
notification_period 24x7
notifications_enabled 1
register 1
}
On the monitored server the command is configured within NRPE as:
command[check_helix_p4d_health]=/usr/lib/nagios/plugins/check_helix_p4d_health $ARG1$*
# Example Helix Nagios Plugins * * * ![Logo](SmallMagnifier.png "Logo") This project contains example bash scripts that show how Helix components could be monitored from a Nagios server. The return states are configurable so could be customised for other monitoring solutions as needed: STATE_OK=0 STATE_WARNING=1 STATE_CRITICAL=2 ### Scripts **check_helix_p4d_health** Description: Helix P4D health checker - Example Nagios monitoring script Usage: check_helix_p4d_health -p <p4port> [options] [tests] options: [-u <p4-user>] [-T <p4-ticket-file-name>] [-P <p4-password>] [--notips] [--comment] [--version] [--help] tests: [--all] [--remoteonly] [--licensecheck] [--licthreshold <expire-days-threshold>] [--pidcheck] [--pidthreshold <running-p4-pid-threshold>] [--p4monitorcheck] [--monthreshold <p4-processes-threshold>] [--p4diskcheck] [--diskthreshold <disk-used-threshold>] [--p4repcheck] If no tests are specified then only the P4D server online test is run. Multiple test arguments can be supplied. Thresholds can be supplied with specific tests or when using '--all' or '--remoteonly'. The '--all' flag runs all tests. The '--remoteonly' flag runs only remote tests using the p4 client. These tests can be run from any machine with network access to the P4D server (including the Nagios server). The '--message' flag allows you to prepend an alert with a custom message. The '-u' flag specifies the P4D user name. This user must be an 'operator' user or must have 'super' access to the P4D server The '-p' flag specifies the P4D hostname and port in the format 'hostname:port'. The '-T' flag specifies the location of the tickets file. The '-P' flag explicitly sets the password to use. Note this may not be secure and the use of 'p4 login' and a tickets file ('-T') on the server provides better security. The '-t' flag specifies the location of the P4TICKETS file for SSL connections. The '--licensecheck' flag tests if the license file is nearing it's expiry date. By default it checks for expiry within '30' days but this can be overriden with the '--licthreshold' flag. The '--p4diskcheck' flag checks for free disk space on the P4D drives using the Perforce command 'p4 diskspace' and warns if the disks are over 95% used. This value can be overriden by speciying a value between 0 and 99 using with the '--diskthreshold' flag. The '--pidcheck' flag counts the number of connected p4d processes using 'netstat' and warns if the are over 500 processes running. This value can be overriden with the '--pidthreshold' flag. This test must be run on the P4D server machine. The '--p4monitorcheck' flag counts the number of commands in the 'p4 monitor' table and warns if there are over 500 running. This value can be overriden with the '--monthreshold' flag. The '--p4repcheck' flags (REPLICA ONLY) checks that replication is running. Example Output: CRITICAL: P4D server not responding! Perforce client error: Connect to server failed; check $P4PORT. TCP connect to localhost:1666 failed. connect: 127.0.0.1:1666: Connection refused TIP: Check if the 'p4d' process is running on the box. Check the P4D log file for errors if it unexpectedly stopped. Examples: Run all checks against server on localhost:1666 check_helix_p4d_health -p 1666 --all Check if license will expire in next 45 days check_helix_p4d_health -p 1666 --licensecheck -liccheck 45 ### Nagios and P4D using SSL If the P4D server is running using SSL then the connection must first be trusted. For example to trust connection 'ssl:localhost:1666': export P4TRUST=/etc/nagios/.p4trust touch $P4TRUST chown nagios:nagios $P4TRUST chmod 600 $P4TRUST p4 -p ssl:localhost:1666 trust The Nagios command then may look similar to: check_helix_p4d_health -p 1666 --p4diskcheck -u operator -t /etc/nagios/.p4tickets ### Nagios and Perforce User Login Many of the Perforce tests require the user to be logged in. The plugin allows you to specify a password with the '-P' flag but it's better practice to use a long lived ticket owned by the Nagios user on the monitored system. For example to create a ticket under '/etc/nagios/.p4tickets' for a Helix P4D user called 'operator': export P4TICKETS=/etc/nagios/.p4tickets touch $P4TICKETS chown nagios:nagios $P4TICKETS chmod 600 $P4TICKETS p4 -p 1666 -u operator login The Nagios command then may look similar to: check_helix_p4d_health -p 1666 --p4diskcheck -u operator -T /etc/nagios/.p4tickets It's also good practice that this user is an 'operator' type user to decrease the amount of power this user has. If you have any queries about Helix P4D users and security discuss this with your security team or 'support@perforce.com'. ### Example Nagios Installation The script can be run using SSH or NRPE. In the example below I use the NRPE plugin. Note that it requires that the NRPE service accepts arguments which can be a security risk. If you have an doubts use SSH or hard code the paramaters on the monitored server. Below are the command definitions for the NRPE service and scripts on the Nagios server. define command{ command_name check_nrpe command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ } define command{ command_name check_helix_p4d_health command_line $USER1$/check_helix_p4d_health $ARG1$ } Below is a service definition that runs the script 'check_helix_p4d_health' use 'check_nrpe' and provides the parameters '-p localhost:1666 --licensecheck'. This can be used to check the Helix P4D license status on port 1666 on server 'master-helix-server'. define service { host_name master-helix-server service_description Helix P4D 1666 License Check check_command check_nrpe!check_helix_p4d_health!'-p localhost:1666 --licensecheck' max_check_attempts 2 check_interval 2 retry_interval 2 check_period 24x7 check_freshness 1 contact_groups admins notification_interval 2 notification_period 24x7 notifications_enabled 1 register 1 } On the monitored server the command is configured within NRPE as: command[check_helix_p4d_health]=/usr/lib/nagios/plugins/check_helix_p4d_health $ARG1$*
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#7 | 18844 | Karl Wirth |
Minor bug fix and more doc updates: - Tests 27 (Replication Fail - journal missing) intermitantly failing due to timing issues with replication. Change test to use 'p4d -cset' instead of 'p4 configure set'. - Test harness section added to README.md . |
||
#6 | 18842 | Karl Wirth | Updates to documentation and minor bug fixes. | ||
#5 | 18835 | Karl Wirth |
Further tidying up and addition of test suite. - refactored p4 commands - extra error trapping - fix numerous bugs - added perl based test suite and reference checkpoint - created new banner - updated README to match current params |
||
#4 | 18801 | Karl Wirth |
Refactoring, tidying up and results of further testing: - Improvements to replication checking (uses p4 pull -lj) - Long opts for all test related arguments - Security enhancements (P4TRUST and P4TICKETS files) - Variable standardisation - Tabs replaced with spaces - Make Nagios happy when no problems found - Extra Perforce user and P4 command feedback checking - New flags --message to provide custom messages. --remote to allow script to be run remotely (from Nagios server) --notip to hide tips if they get to annoying TBD - Check for replication slow down. Documentation updated to match code. |
||
#3 | 18711 | Karl Wirth |
Adding an update to the alpha version that: - checks disk space. - checks p4 monitor output. - check replication status. - P4 binary exists and is executable. - standardises on paramater names. Documentation updated with latest changes. To Do: Tidy up, adherence to coding standards and further testing. |
||
#2 | 18302 | Karl Wirth | Fixing formatting problems with the README.md | ||
#1 | 18294 | Karl Wirth |
Alpha version of an example Nagios plugin for the Helix P4D server and related documentation. The intention is to complete this by adding additional checks and error handling then add scripts for Git Fusion, Swarm and GitSwarm. |