SDP-302

akwan (akwan)
akwan created this job , modified by C. Thomas Tyler
Closed
Parallelized checkpoint processing to reduce duration.

Enable parallel checkpoints, and include test suite coverage for
same.

Excerpt of email from Alan Kwan:
---
 I've framed out a pseudo code implementation of how it could behave
as backup_functions in SDP:

dump_checkpoint_parallel()

- get list of db files (this can be optimized to sort by largest or
smallest to keep work queues as saturated as possible)
- get p4_var variable set to # of worker threads, else use logic to
determine a just in time value:
- figure out cpu core count
- check system active load value
    - define # of threads = to core count minus active load, minus 1 (if
result is 0 or less than 1, set to 1 - not parallel)
- define work queue (ls -1 /p4/1/offline_db/)
- insert code to execute against work queue based on (
http://hackthology.com/a-job-queue-in-bash.html ), and while limit,
keep working until end - checkpoint files are named
/p4/1/checkpoints/p4_1.ckp.db.have.number.gz (along with their MD5)
    - rewrite the offline_

restore parallel would implement something similar - get the list of
compressed checkpoint files, throw in a work queue, and jr -z on each
one into the same offline_db folder until they're all done.

augment remove_old_checkpoints_and_journals to incorporate these sort
of checkpoints

Excerpt of email from Robert Cowham:
---

An alternative step along the way also is to use pigz or similar for
parallel compression which is where a lot of time is spent.

Typically the focus should be on the 3-7 or so files which comprise
the vast majority of the data (db.have/db.rev and friends/db.integed/
db.label depending)

I would also be tempted to tar the result into one file rather after
zipping/before unzipping for ease of management.
24769New script for performing a parallel checkpoint.
Run as follows:
     parallel_ckp.sh <instance> -P <threads>

New script to restore a parallel checkpoint file to the offline database
in case a recovery is needed.
Run as follows:   
    parallel_ckp_restore.sh <instance> -f <parallel_ckp_file.tgz> -P <threads>
25374New script for performing a parallel checkpoint.
Run as follows:
     parallel_ckp.sh <instance> -P <threads>

New script to restore a parallel checkpoint file to the offline database
in case a recovery is needed.
Run as follows:   
    parallel_ckp_restore.sh <instance> -f <parallel_ckp_file.tgz> -P <threads>
24768New script for performing a parallel checkpoint.
Run as follows:
     parallel_ckp.sh <instance> -P <threads>

New script to restore a parallel checkpoint file to the offline database
in case a recovery is needed.
Run as follows:   
    parallel_ckp_restore.sh <instance> -f <parallel_ckp_file.tgz> -P <threads>

#review-24769
  • Details
  • Comments -
Status
Closed
Project
perforce-software-sdp
Severity
C
Reported By
akwan
Reported Date
Modified By
C. Thomas Tyler
Modified Date
Owned By
tom_tyler
Dev Notes
[2023/04/13 tom_tyler]: This job is now closed.
 Parallel checkpoints
are now fully supported. The needed p4d features to ensure reliable
processing have been released, and the SDP now takes advantage of
them.  See notes about DO_PARALLEL_CHECKPOINTS in the Instance Vars
file (e.g. /p4/common/config/p4_1.vars) for more info.

[2021/07/06 tom_tyler]: This job has been suspended.  Turns out some
needed p4d support (a command to get a list of checkpointed tables)
isn't available. Also, there is hope that a future release of p4d will
provide this capability without the need for scripting.

While there are implementations of the parallel checkpoint mechanism
that have been made to work (by checkpointing all tables whether they
need it or not), this is the sort of thing that can never fail.  We
decided this feature, while it would be valuable, is best done as a
p4d feature rather than an SDP feature.  When the needed functionality
is added to p4d, this job will be re-opened.

[2020/08/18 tom_tyler]: Re-opening this job to re-add this feature,
with full test suite coverage.

Older Notes:

This can be done reliably, but will be sophisticated.  We may want
to add an optional new setting in instance_vars.template, e.g.
PARALLEL_CHECKPOINTS with a default value of 0.

Then either dump_checkpoint() or dump_checkpoint_parallel() would be
called depending on whether that new var is set to 1 or not.  So
by default it would still do single-threaded checkpoints, and
would do parallel checkpoints if explicitly enabled.
Component
core-unix
Type
Feature