cvs2p4: A toolset for importing CVS into Perforce

			    Richard Geiger
			  rmg@perfortify.com
			     July 24, 2006


==== INTRODUCTION    

This small, free, set of tools provides a means for importing CVS
modules into Perforce.

It was originally developed for use at Network Appliance in the spring
of 1997, to convert our product source code revision history from CVS
into Perforce.

At the time, as an afterthought, I put together a public distribution,
hoping that the work might benefit others.

Since that time there has been an steady flow of users, and an
unsteady flow of improvements aimed at ease of use, performance,
accuracy, capacity, and flexibility.

I have since used the tool to perform two other "real" migrations at
other Perforce customers, and am gearing up for a third, which will be
the largest and most intricate I have yet attempted.

I have also tried to support anybody interested in using cvs2p4, and
to provide prompt bug fixes when bugs have been reported.

cvs2p4 was inspired by, and is patterned at a high level after the
PVCS to Perforce converter available on the Perforce web site. A
conversion consists of the following phases, each of which is
performed by a separate perl script:

  - bin/genmetadata

      Scans the CVS repository, parsing every RCS archive file to
      generate a single metadata file which holds all of the
      information needed needed by the subsequent phases. Several
      other files produced by this phase provide further information,
      and are used by later steps to convert CVS release tags into
      Perforce labels.

  - bin/genchanges

      Scans the metadata file produced by the previous phase, to
      identify groups of RCS revisions that comprise Perforce atomic
      changes, and writes a file describing them for use by the next
      phase.

  - bin/dochanges

      Based on the data produced by the previous phases, generates
      Perforce metadata in Perforce journal format. This metadata
      refers directly to revisions in the original RCS archives in
      CVS.

      *** No new ,v files are generated by the conversion! ***
      Rather, you run the Perforce server against a copy of (or
      through a link to) the original CVS repository.

  - bin/dolabels

      Using the data created by previous phases, creates the Perforce
      metadata required to represent CVS tags as Perforce labels.

Essentially, cvs2p4 tries make the resultant Perforce depot look (as
much as possible) as if the work in CVS had been going on in
Perforce.

In particular, it attempts to model changes corresponding to branch
creations as if they had been done with:

  p4 integrate //depot/branchA/... //depot/branchB/...

This is in contrast to the rcstoperf.sh script, which scatters the
"integrates" corresponding of the creation of files on new branches
into many changes (basically, according to when the file was actually
first changed in the new branch).

cvs2p4 also allows you to import only selected branches, and/or to map
some branch other than the the CVS trunk to become the new "main"
branch in Perforce. See the notes in the template config file
("test/config") for more information on these features.

Note: A CVS tagged revision will make it into a Perforce label ONLY
when the revision is in fact present in the converted depot, subject
to the branches selected for import. (See the notes for the
"$WANTLINES" variable in the config file).


==== MANIFEST

*** Note: You should unpack the archive on the OS you intend to run
the conversion under. I.e., do not expect to be able to unpack the
archive on a Windows machine, and have it run properly on a *nix host.

After unpacking the distribution archive, use the MANIFEST script to
verify that you have all of the pieces.  The output should look
something like this:

    $ MANIFEST
    MANIFEST
    README
    NEWS
    config.tmpl
    bin/cvs2p4
    bin/genmetadata
    bin/genchanges
    bin/dochanges
    bin/dolabels
    bin/revmap
    bin/srcdiff
    bin/cvs2p4
    lib/util.pl
    src/rcs-5.7/src/rlog.c.patch
    test/file,v
    test/phone.gif,v
    test/dollar$file,v
    test/space file,v
    test/pound#file,v
    test/percent_%file,v
    test/at@file,v
    test/star*file,v
    test/datefile_readd,v
    test/Attic/datefile,v
    test/config.test
    test/runtest
    test/norm
    test/metadata.good
    test/lines.good
    test/changes.good
    test/p4_changes_-l.good
    test/p4_describe.good
    test/p4_describe-new.good
    test/p4_filesat.good
    test/p4_labels.good
  
    All ok
    $

==== REQUIREMENTS    

This stuff should work on any Unix host that supports:

  - Perl 5.x, with working dbm support (i.e., dbmopen()/dbmclose()
    work). The scripts assume that perl will be found via $PATH. It
    must be a perl5! Some people have reported problems that seem to
    be related to dbm limitations with some perls when converting very
    large repositories. I like implementations based on Berkeley-DB.
    As supplied, you'll need to have the perl DB_File package installed.

  - Compiler suite capable of building RCS 5.7 (to make a special
    slightly tweaked version of rlog from a patch supplied with this
    package).

  - Perforce server (p4d) release 2002.1 or later.

    cvs2p4 has been used successfully with Perforce releases up to
    2006.1.

    Later Perforce releases may work, but since this script generates
    journal-format metadata directly, it may need to be changed in
    order to work correctly with other Perforce releases. Please see
    the "PERFORCE METADATA DEPENDENCIES" section (below) for further
    details.
    

==== WHAT IT DOES

This converter will import a CVS module (or a group of them at once)
into Perforce, preserving the branching structure seen in the RCS ,v
files in the CVS repository, and translating them into Perforce
branches within the depot. It will only import RCS branches up through
the highest numbered revisions on branches that have branch tags
referring to them; thus, it will not necessarily bring *every*
revision in the CVS module into Perforce, but *will* bring in every
revision leading up to the current revision for every branch it
imports. I think this is what most people will want; if not, hack
away.

Like the "rcstoperf.sh" converter available on the Perforce web site,
it applies heuristics to try and identify multiple changes in CVS that
are highly likely to comprise what would be seen as a single change in
Perforce, and makes them appear as a single Perforce change. (The
heuristics are: checked in by the same user, proximal in time, and
bearing an identical log message).

It deals correctly with files that are dead on the CVS trunk (I.e.,
where the RCS ,v files are in the "Attic/".

The converter attempts to leave converted files in Perforce with a
sensible Perforce file type (See `p4 help filetypes` for a description
of file tyeps in Perforce) after the conversion. However, due to
limitations in RCS's notion of "file type" (the -k options,
controlling keyword expansion), cvs2p4 must currently decide to import
all "text" files as Perforce type "text" (text with no keyword
expansion) or "ktext" (text with keyword expansion). This is
controlled by the "$KTEXT" configuration option, which is on by
default.

Also note that binary files will be converted to Perforce type
"binary+D"; the (unusual) "+D" is there because the converter works by
using the existing RCS archive files directly; normally in perforce,
filetype "binary" implies storage of complete revisions, rather than
as RCS archives. Rest assured that "binary+D" is correct.

The "UI" for the converter is not very slick, but for most people it's
a one-time kind of tool anyway. Feel free to improve it if you are so
inclined.

Please understand that this tool is *not* officially supported by
Perforce. It is supplied in hopes that somebody will find it useful
(Or perhaps only entertaining :-).


==== ***** Caveat!: *****

As of release 3.0, cvs2p4 now employs a modified rlog (see below) in
order to gain better performance. However, in the current implementation,
if any CVS log messages contain lines satisfying either:

  $_ eq "=============================================================================\n"
or
  $_ =~ /^revision\s+([^\s]+)\s*next\s*([^\s]+)?$/

the conversion will fail in unpredicatable ways.

This limitation will be lifted in a future release.


==== src/rcs-5.7/src/rlog.c.patch

This release of cvs2p4 relies on a patched version of the RCS "rlog"
command.

To use this release, you'll need to

  - Get RCS 5.7 sources (from ftp://ftp.cs.purdue.edu/pub/RCS/);

  - Unpack them, and apply the patch in src/rcs-5.7/src/rlog.c.patch
    to the rlog.c supplied in the RCS distribution;

  - Build them to produce an "rlog" command;

  - Copy the built rlog binary into the .../cvs2p4/bin directory,
    where "bin/genmetadata" will look for it.

I hope to provide pre-built binaries for at least a few popular
platforms in upcoming releases.


==== TESTING

I have included a *very* rudimentary automated test "suite", in the
test/ directory. You can use this to verify that it seems to work in
your environment.

To run it:

  1. Edit test/config, and change the lines

       # p4 command location (If other than "/usr/local/bin/p4")
       #
       $P4             = "/usr/local/bin/p4";
       
       # p4 command location (If other than "/usr/local/bin/p4d")
       #
       $P4D            = "/usr/local/bin/p4d";
       
     to reflect the actual location of your "p4" and "p4d" commands,
     and the server port that you want to be used during the conversion.
     (it must be localhost: with an unused port number; this is used only
     while running the conversion - you can of course, run your production
     server with the result of the conversion using any port you desire).
     
     Also, verify that the port number in $P4PORT is currently unused
     on the host where you will run the conversion:

       # Perforce server to use during the conversion. Must be
       # "localhost:" and some unused port number. THIS SHOULD NEVER
       # BE POINTED AT A PRODUCTION PERFORCE SERVER INSTANCE!
       #
       $P4PORT = "localhost:1680";
       
  2. Run the tests with

       test/runtest

     This should run all of the conversion scripts on a test CVS
     module, and then verify a few things by querying the Perforce
     server after the conversion is complete.

     If everything goes well, the end of the output should be

       runtest: ok

In these tests, the converted CVS "module" consists of a very few
files, but it does have a carefully constructed branching structure,
intended to verify that the converter does the right stuff with
respect to branching.


==== USAGE

Once you have got the test running properly, you can turn your
attention to your conversion.

1. Make a directory to hold all the working files for the conversion,
   and create a config file, starting with test/config as a template:

     $ mkdir convdir; cp test/config convdir

   In general, all of the configuration settings and optins for a given
   conversion are specified in the config file.

   Edit the convdir/config file to reflect your locale and
   intent. (See the comments in the config file for descriptions of
   the settings and options).

2. Run bin/cvs2p4:

   The script takes a single argument- the name of the directory where
   the "config" file resides. (It will create all intermediate, temp,
   and working files under this directory, which we will refer to as
   the "conversion directory".)

-OR-:

   bin/cvs2p4 executes each of the four stages of a full conversion,
   in turn.  If any stage fails, the conversion will terminate without
   attempting to run remaining stages. If you desire, you can run each
   of the four stages yourself, (i.e., without using bin/cvs2p4). The
   commands used to run them are shown below:

   2a. Run bin/genmetadata:

       As for each of the four phases, the script takes a single argument
       - the name of the directory where the "config" file resides. (It
       will create all intermediate, temp, and working files under this
       directory, which we will refer to as the "conversion directory".)

         $ bin/genmetadata convdir
         genmetadata: rm -rf convdir/logmsgs.dir convdir/logmsgs.pag ...
         .
         . (filenames of each file in the CVS module, as they are scanned)
         .
         ===== Lines referenced:
         chupa
         curly
         ha         <- a list of branch tags encountered in the scan;
         larry         also saved to convdir/lines.
         shemp
         xxx
    
       This reads cvsdir/config to get its marching orders, then scans the
       CVS module(s) for all ,v and Attic/,v files, creating:

         convdir/metadata      <- the extracted RCS/CVS metadata
         convdir/logmsgs.pag   <- An ndbm database
         convdir/logmsgs.dir   <-   of the log messages
         convdir/lines         <- A list of "codelines" (== branch tags)

       At this point, you may want to look at the list of branch tags
       encountered, (which was written to convdir/lines), edit the config
       file, setting $WANTLINES to 1, and filling in the "<<LINES" here
       file with the names of the branches you want to import to Perforce;
       then, rerun bin/genmetadata to rescan and pick up only those
       revisions you care about.

       Note: the names used in $WANTLINES are the CVS branch
       names. Use the value of $TRUNKLINE in $WANTLINES to specify
       that you want the "main" branch.

       
   2b. Run bin/genchanges:

       Again, this takes a single argument - the name of the "conversion
       directory":

         rmg $ bin/genchanges convdir
         16354                    <- This counter spins as it's running.
                                     It will count up to the number of
                                     lines in the metadata file.

       This reads convdir/config and convdir/metadata, and writes
       convdir/changes.


   2c. Run bin/dochanges:

       This, too, takes a single argument - the name of the "conversion
       directory":

       (You might want to save a copy of the output with "tee".
       The output will look something like:)

         rmg $ bin/dochanges convdir 2>&1 | tee OUT
         dochanges> /bin/rm -f convdir/revmap.db ...
         dochanges> /bin/rm -f convdir/depotmap.db ...
         dochanges> /bin/rm -rf p4root && mkdir -p p4root
         dochanges> /bin/mkdir -p /home/rmg/web/richard_geiger/...
         dochanges> /bin/ln -s /home/rmg/web/richard_geiger/...
         ========== change group 1
         ========== change group 2
         ========== change group 3
          .
          .
          .
         ========== change group 17
         ========== change group 18
         dochanges> cd /home/rmg/web/richard_geiger/...
         Recovering from dbmeta...
         dochanges> cd /home/rmg/web/richard_geiger/...
         Dumping to checkpoint...
     
       When this command finishes, your CVS module has been imported to
       Perforce, in the Perforce server database identified by the $P4ROOT
       configuration variable. The state of the resultant database is
       saved in a checkpoint file named $P4ROOT/checkpoint.

       *** NOTE ***: cvs2p4 does not create new RCS-format archives (,v
       files) under $P4ROOT; rather, it uses the existing RCS archives in
       the CVS tree directly. By defasult, does this by making a symbolic
       link named $P4ROOT/depot/IMPORT pointing to the $CVS_MODULE
       tree. If you'd rather have dochanges copy in the CVS module for
       you, set COPYIMPORT in the config file.


   2d. If you want to import CVS tags as Perforce labels, there is an
       additional phase (once again, the single argument is the name of
       the conversion directory where the config file lives):

         $ bin/dolabels convdir
         make label: testlabel
         dolabels> cd p4root && /usr/local/bin/p4d -jr dblbls
         /home/rmg/web/richard_geiger/guest/richard_geiger/utils/cvs2p4_meta/p4root
         Recovering from dblbls...
         dolabels> cd p4root && rm -f checkpoint; /usr/local/bin/p4d -jd checkpoint
         /home/rmg/web/richard_geiger/guest/richard_geiger/utils/cvs2p4_meta/p4root
         Dumping to checkpoint...
    
       This step adds the symbolic tag information from the CVS archive
       (for "plain", non-branch tags) to the Perforce database identified
       by the $P4ROOT configuration variable.  The state of the resultant
       database is saved in a checkpoint file named $P4ROOT/checkpoint.
 
       
3. If you want the RCS revision-to-Perforce change map, run:

     $ bin/revmap convdir

   Or, for the reverse mapping:

     $ bin/revmap -map rrevmap convdir


==== PRESCAN MODE

During the course of a conversion, the bin/genmetadata phase can
detect and report unusual conditions which may indicate "corruption"
in the CVS repository. Typically, you'll want to deal with these prior
to performing the live conversion.

In order to make a quicker way of finding these conditions,
bin/genmetadata now supports a "-prescan" flag. When run this way:

  $ bin/genmetadata -prescan convdir

genmetadata will (as usual) parse each RCS archive file in the CVS
repository, and report any conflicts it finds, but it will NOT bother
to parse some additional information needed for an actual conversion,
nor produce all of the output metadata needed by an actual conversion.

This allows you to perform one or more "-prescan"s on your CVS
repository to more quickly resolve any such problems.

bin/cvs2p4 can also be used to run a "prescan", e.g., "bin/cvs2p4 -prescan"
will run bin/genmetadata (only) with the -prescan options. None of the
remaining converison stages are performed.


==== INCREMENTAL CONVERSIONS

At this time, the recommended procedure for doing "incremental"
conversions - i.e., combining multiple CVS repositories, or doing
subsets of the CVS modules in a repository one at a time - is to do
each as a new conversion (starting with change 1), and then to combine
them as desired using the "perfmerge2.pl" tool.

This is also a useful pattern when you want to combine some new chunk
of CVS (or RCS) repository into an existing Perforce depot.

perfmerge2.pl can be obtained by sending email to support@perforce.com

In order for this to work, you'll need to insure that there is no
overlap in the namespaces of files, between your existing Perforce
repository and the newly converted files. See the notes at the top of
the perfmerg2.pl script.

perfmerge2.pl can operate in different modes, with respect to the
ordering of change numbers in the merged repositories. You can elect
either

  - to have it renumber all of the merged changesets, so that the
    time-ordered property of all change numbers (both existing and
    newly-merged) is preserved; or,

  - to leave your existing changes remain numbered as they are, with
    the newly imported changed numbered from the next available change
    number, even though some of them may have taken place (in CVS)
    interleaved in time with your existing Perforce changes.

Note that perfmerge2.pl only merges server metadata; you'll also need
to manually copy the tree of RCS archive files from your newly
converted $P4ROOT into your existing server's $P4ROOT.


==== PERFORCE METADATA DEPENDENCIES

Since cvs2p4 works by directly generating Perforce metadata in the
perforce checkpoint/journal format, it is dependent on "knowing" the
right definitions for certain tables within the Perforce database.

As of this writing, cvs2p4 writes metadata for the following Perforce
tables, at the version number shown for each table:

  table name    ver
  ------------  ---
  db.change     0
  db.desc       0
  db.rev        3
  db.revcx      0
  db.integed    0
  db.depot      0
  db.domain     2
  db.counters   0

The tests provided with this package are known to work correctly using
any Perforce server from version 2001.1 to 2005.2. It should work
correctly with any new p4d version that can still read (and upgrade
from) 2002.2 metadata.


==== IMPORTING TO MULTIPLE DEPOTS

If you wish to divide the body of CVS being imported into multiple
depots in Perforce, you can establish mappings in the config file by
adding lines of the form:

    $Depotmap{"<topdir>"} = "<depot>";

      <topdir>... is the name of a directory in $CVSROOT;
      <depot>.... is the name of the Perforce depot to create.

For example, if $CVSROOT points to "/cvsroot", and you want the
files from $CVSROOT/somedir/ to be placed into the Perforce depot
"//somedepot", you would add

    $Depotmap{"somedir"} = "//somedepot";

Note that the slashes _must_ be present in the value as shown above!

Perforce Depot Specifications are created for each depot used.


==== IMPORTING CVS TAGS AS PERFORCE LABELS

Some very basic architectural differences between CVS and Perforce
create challenges when attempting to represent CVS tags as Perforce
labels. This is a side effect of a difference in the way CVS and
Perforce handle branching and the way CVS tags work.

The essential problem is as follows...:

In CVS, when a file has been branched, but not yet changed in the
child branch, there is no way, based only on information in that
individual CVS archive, to determine _which_ branch(es) a label was
intended to apply to. This is due to the fact that, in CVS, creating a
new branch does not actually create a new branch revision within the
RCS archive; it merely marks the branch point with a symbolic
name. This is not a problem for CVS, since the act of checking out a
CVS tree using a CVS tag will always retrieve a single CVS revision
for every file bearing the tag. In Perforce, however, Perforce's "lazy
copy" mechanism creates (if only virtually!)  a distinct #1 revision
in the child branch at branching time. Thus, for a given file, the #1
revisions for multiple Perforce branches may share the same parent
branch and revision. In such cases, there's no way for the "dolabels"
command to know specifically _which_ child branch a given CVS tag was
meant to apply to.

Previously, cvs2p4 "punted" the problem, by including the #1 revisions
in all of the child branches, to "share" a tagged revision (i.e.,
those that share the common branch point in CVS) in the imported
Perforce label. Unfortunately, for respositories with large numbers of
tags, this approach suffered badly from poor performance and
increasingly large numbers of "noise" revisions being created in the
resulting Perforce labels.

More recently, provisions have been added to bin/dolabels for a
user-supplied mapping function, in order to infuse the converter with
outside knowledge about which tags go with with branches. (And a way
of selecting which tags to actually import as labels!)

Finally, as of cvs2p4 3.0, a heuristic based on looking at the state
of a tag across *all* files bearing the tag has been implemented, and
is enabled by default. 

  *** As of this release, by default, only those labels for which the
  *** label -> mapping has been established by the heuristic will be
  *** represented as labels in the resultant Perforce depot! This behavior
  *** may be changed, using the "$DISCARD_UNMAPPED_TAGS" configuration
  *** variable.

Since the heuristic may not work in all circumstances or for all
labels, a user-supplied mapping function, to be applied when the
heuristic fails, can be supplied. Presently, it is coded as the
subroutine "branch_for_tag()" in bin/dolabels, and you should edit the
script if you need to implement such a function.


==== CVS "import"-ED FILES

CVS files which were created by running "cvs import" once or more to
import "vendor branches" have some interesting properties.

In particular, when a cvs import-ed file has not yet had any changes
commited to the trunk (i.e., any local changes to what was originally
imported from the vendor branch), a "cvs checkout" or "cvs update"
without no other arguments will use the latest vendor branch revision
present for the file.

For example, Consider an RCS file with the following revisions:

  1.3
  1.2
  1.1
  1.1.1.1
  1.1.1.2
  1.1.1.3
  1.1.1.4

The 1.1, 1.1.1.1, 1.1.1.2, and 1.1.1.3 revisions were created by three
"cvs import" commands before revision 1.2 was commited onto the trunk.
(The 1.1.1.4 revision was created -after- 1.2).

In Perforce, the sequence of revisions on the "main" branch would
be:

  1.1 		main/<file>#1
  1.1.1.2 	main/<file>#2
  1.1.1.3 	main/<file>#3
  1.2 		main/<file>#4
  1.3 		main/<file>#5
  
In such cases, revision tags selecting revisions from the 1.1.1 branch
"spliced" into main (1.1.1.2 and 1.1.1.3, in our example) are placed
into the Perforce labels in *both* the main and import branches.


==== HOW TO PACKAGE MODIFICATIONS

  1. In a //guest workspace, make your modifications to files in the
     package. Before submitting:

  2. Edit the NEWS file to document the change(s). (Please follow the
     established format. Of course.)

  3. Do the submit.

  4. Update the checksums in the MANIFEST file:

       $ p4 edit MANIFEST
       <if your mods add files, edit MANIFEST to reflect this>
       $ MANIFEST -gen
       
  5. As root, Generate the release tarball:

       # MANIFEST -tar <vers>

     (This creates cvs2p4-<vers>.tar)

  6.  And add it to the default changelist:

       $ p4 add cvs2p4-<vers>

  7.  Finally, update the "cvs2p4-latest" symbolic link:

       $ p4 edit cvs2p4-latest.tar
       $ rm cvs2p4-latest.tar
       $ ln -s cvs2p4-<vers>.tar cvs2p4-latest.tar

  8. And then submit. In this scenario, the change will include
     the following files:

        MANIFEST
        cvs2p4-<vers>.tar 	(added)
        cvs2p4-latest.tar

To complete the act of "publishing" the new release, you must have
Perforce write access to //public/perforce/utils/cvs2p4/...
"Publishing" the new release a simple a matter of integrating your
change(s) into the //public/..., and submitting.


==== SUPPORT

I try to maintain this tool as a contribution to the community at
large. If you have questions or problems, please feel free to email me
( rmg at perfortify dot com ) 

I originally wrote and contributed this tool while working for Network
Appliance in 1997.

I worked on it as a Perforce employee from August 2002 through
December 2003.

I have since also worked at Data Domain, of Palo Alto, California,
where further improvements were made.

Presently, I work at IronPort Systems in San Bruno, California,
and, as you might guess, cvs2p4 is evolving again.

I would like to gratefully acknowledge the support of all of these
employers, who have allowed me to maintain and improve cvs2p4 both for
their ends, as well as for others who might find them useful.

I'd enjoy hearing from anyone who uses this (or tries to!), whether
you have problems and questions, or not. Drop me a line!,

Thanks,   

  - Richard Geiger  rmg at perfortify dot com

    revised July 2006, release 3.0