changesets.pm #13

package VCP::Filter::changesets;

=head1 NAME

VCP::Filter::changesets - Group revs in to changesets

=head1 SYNOPSIS

  ## From the command line:
   vcp <source> changesets: ...options... -- <dest>

  ## In a .vcp file:

    ChangeSets:
       time                     <=60   ## seconds
       user_id                  equal  ## case-sensitive equality
       comment                  equal  ## case-sensitive equality
       branched_rev_branch_id   equal  ## change only one branch at a time

=head1 DESCRIPTION

This filter is automatically loaded when there is no sort filter loaded
(both this and L<VCP::Filter::sort|VCP::Filter::sort> count as sort
filters).

=head2 Sorting by change_id, etc.

When all revs from the source have change numbers, this filter sorts by
change_id, branch_id, and name, regardless of the rules set.  The name
sort is case sensitive, though it should not be for Win32.  This sort by
change_id is necessary for sources that supply change_id because the
order of scanning the revisions is not usually (ever, so far :) in
change set order.

=head2 Aggregating changes

If one or more revisions arrives from the source with an empty
change_id, the rules for this filter establish the conditions that
determine what revisions may be grouped in to each change.

In this case, this filter rewrites all change_id fields so that the
(eventual) destination can use the change_id field to break the
revisions in to changes.  This is sometimes used by non-changeset
oriented destinations to aggregate "changes" as though a user were
performing them and to reduce the number of individual operations the
destination driver must perform (for instance: VCP::Dest::cvs prefers
to not call cvs commit all the time; cvs commit is slow).

If you don't specify any conditions, a set of default conditions are
used.  If you specify any conditions at all, none of the default
conditions are used.

The conditions are specified as a set of rules.  Each rule is a pair of
words: a the name of the criterion (usually a data field of the revision)
and the compatability specification.

There are two compatability specifications: "equal" and "<=N"
(where N is a number; note that no spaces are allowed before the number
unless the spec is quoted somehow).  "equal" states that all revisions
in the same change must have identical values for the indicated field.
So:

    user_id                  equal

states that all revisions in a change must be submitted by the same
user.  The "equal" specification is available for all criterial.

The "<=N" specification is only available for the "time" field.  It
specifices that no gaps larger than N seconds may exist in a change.

The default conditions are:

    time                     <=60   ## seconds
    user_id                  equal  ## case-sensitive equality
    comment                  equal  ## case-sensitive equality
    branched_rev_branch_id   equal  ## change only one branch at a time

The C<time <=60> condition sets a maximum allowable difference between two
revisions; revisions that are more than this number of seconds apart are
considered to be in different changes.

The C<user_id equal> and C<comment equal> conditions assert that two
revisions must be by the same user and have the same comment in order to
be in the same change.

The C<branched_rev_branch_id equal> condition is a special case to handle
repositories like CVS which don't record branch creation times.  This
condition kicks in when a user creates several branches before changing
any files on any of them then all of the branches get created at the
same time.  This condition also kicks in when multiple CVS branches
exist with no changes on them.  In this case, VCP::Source::cvs groups
all of the branch creations after the last "real" edit.

The C<branched_rev_branch_id> condition only applies to revisions
branching from one branch in to another.

For all fields but C<time> and C<mod_time>, you may (for now) only use
the C<equal> condition, which is case sensitive equality.

An implicit condition is that a parent and a child may not both be
altered in the same change.

=head1 NEW ALGORITHM

As revs are received by handle_rev(), they are store on disk.  An
in-memory data structure (for now) is built that notes each revision's
child revisions.

In handle_footer, a clustering algorithm begins.  Starting with the set
of parentless revisions,

=over

=item 1

each revision is placed in a change cluster.  A change cluster is the
set of all revisions in the same change.

=item 2

the most mature change cluster is identified and taken as a change.
"mature" means the cluster containing the oldest revision across all the
change clusters, as ordered by the set of keys (change_id, time, 
revision number, name, comment), where empty fields are ignored.

=item 3

All revs in the cluster are given a new change_id, starting with "1" and
sent downstream.

=item 4

The (possibly empty) set of children of the revisions from the newly
minted change is taken as the new set of revs, and the process loops
back to step 1 unless there are no more clusters.

=back

=begin developer_docs

A few thoughts on clustering
============================

This algorithm is a simplified k-clustering algorithm: it divides the
set of revisions in to clusters and emits each cluster as a changeset by
labelling each cluster with an integer (1..N) change number.  The
labelling is done by setting the change_id for all revs in a cluster to
this change number.  Changes are sent downstream from in order from
(1..N) as they are identified.

We never want a cluster to contain...
    - two revisions for the same filebranch.
        - This means that a cluster should never contain two
          files with $r->name and $r->branch_id both equal.
    - two revisions where one is a descendant of another.

Additionally, by default we never want a cluster to contain...
    - two revisions from different repositories
    - two revisions from different users
    - gaps in time greater than 60 seconds

NOTE: we don't take source_repo_id in to account because we don't
need to handle that case now and it complicates the logic when all revs
do have change_ids.  We'll let the user figure out how to differentiate
that case or add a different filter.  At that time, we should also add
a "repo_id" field to VCP::Rev.

We never want a revision to be in the same cluster (or change) as any of its
ancestors or descendants.  Because the revisions are all linked to their
parents when they arrive, this filter can identify the "root" revisions (those
with no parents) and cluster them first.  It may then identify a single cluster
as a change and remove it from consideration--this becomes change 1, and is
emmited immediately--then gather the child revisions of that change and cluster
those with the existing clusters, then select and emit change 2.

So, instead of accepting an unstructured soup of revisions and needing
to perform a generalized k-clustering algorithm, this algorithm accepts
an multirooted acyclic digraph and incrementally clusters it by starting at
the roots and proceeding in a wavefront through the entire data set.

Furthermore, instead of using a real number as the cost, as most
generalized clustering algorithms do, this algorithm can use a simple,
transitive test function.  So to determine if an unclustered "new"
revision should be added to an existing cluster it suffices to consider
each revision in the cluster and if the test function returns true for
any of the revisions in the cluster, the new revision is a member of the
cluster.

Likewise, if the new revision is a member of more than one cluster,
those clusters may be merged in to a single cluster because the
test is transitive.


The test function
-----------------

This filter uses a boolean function to determine whether two revs are in
the same change or not.  If they are, it returns TRUE, if they are not,
it returns FALSE.

The test function does not need to test for parent/child relationships,
the algorithm will never consider two revisions that have an ancestry
relationship between them.  No child may have more than one parent
(children may have only one previous_id), and children of already
clustered revisions are not compared with the revisions of the parent
cluster.

As far as transitivity goes, there are two types of tests applied in the
comparison function: equality and value range.  Equality is transitive,
of course.  The value range is only used for the time field, and we
specifically want this value to be transitive: if rev A is close enough
in time to rev B, and rev B is close enough in time to rev C, then we
want rev A and rev C to be considered to be part of the same change.
This is so that a set of files submitted over a period of time (as may
happen in systems like VSS or CVS) can still result in a change
relationship.

The test function is compiled in to an anonymous sub and placed in
$self->{CAN_ADD_SUB}.

=end developer_docs

=cut

$VERSION = 1 ;

use strict ;
use VCP::Logger qw( lg pr BUG );
use VCP::Debug qw( :debug );
use VCP::Utils qw( empty );
use VCP::Filter;
use VCP::Rev;
use base qw( VCP::Filter );

use fields (
   'CAN_ADD_SUB',     ## Return TRUE if a rev can be added to a sub.
   'ADD_REV_SUB',     ## Adds a rev to a cluster
   'COMMENT_TIMES',   ## The average time of all instances of a comment
   'HAS_CHANGE_IDS',  ## set if all revs have nonempty change_ids
   'CLUSTERS',        ## an ARRAY of clusters.
   'CHILDREN',        ## A HASH keyed on a rev's id of ARRAYs of the
                      ## rev's children's ids.
   'REV_COUNT',       ## How many revs we received
   'ALL_HAVE_CHANGE_IDS', ## Set if all incoming revs have change_ids
   'REVS_BY_CHANGE_ID',   ## HASH of change_id => \@rev_ids
);

sub _eq {
    defined $_[0] 
       ? defined $_[1]
           ? $_[0] eq $_[1]
           : 0
       : ! defined $_[1];
}


sub _compile_can_add_sub {
   my VCP::Filter::changesets $self = shift;
   my ( $rules ) = @_;

   my @cmps = join "\n      && ", map {
      my ( $field, $cond ) = map lc, @$_;
      my $cmp;

      if (
         ( $field eq "time" || $field eq "mod_time" )
         && $cond =~ /\A<=\s*(\d+)\z/
      ) {
         my $limit = $1;
         $cmp = <<CMP;
(
         empty( \$r->time ) 
         || (
               ( empty( \$cluster->{min_time} )
                  || \$r->time >= \$cluster->{min_time} - $limit
               )
            && ( empty( \$cluster->{max_time} )
               || \$r->time <= \$cluster->{max_time} + $limit
            )
         )
         || ( debugging
            and debug "    times: ",
               \$cluster->{min_time}, " - $limit <= ",
               \$r->time, " <= ",
               \$cluster->{max_time}, " + $limit"
            and 0
         )
      )
CMP
      }
      elsif ( $field eq "branched_rev_branch_id"
         && $cond eq "equal"
      ) {
         $cmp = <<'CMP';
(
         (
            $r->is_placeholder_rev 
            || do {
                  my $filebranch_id = join "\001", $r->name, _d $r->branch_id;
                  !exists $cluster->{FileBranches}->{$filebranch_id}
               }
         )
         || ( debugging
            and debug "    !is_placeholder_rev or filebranch_id seen"
            and 0
         )
      )
CMP
         chomp $cmp;
      }
      elsif ( $cond eq "equal" ) {
         $cmp = <<CMP;
(
         _eq( \$cluster->{$field}, \$r->$field )
         || (
            debugging
            and debug "    $field differs from '", \$cluster->{$field}, "'"
            and 0
         )
      )
CMP
      }
      else {
         die "unknown ChangeSets condition in rule: $field $cond\n";
      }

      $cmp;
   } @$rules;

   my @code = ( <<'PREAMBLE', "   ", @cmps, <<'POSTAMBLE' );
#line 1 VCP::Filter::changesets::can_add()
sub {
   my $cluster = shift;
   my ( $r ) = @_;
   ## All the empty()s are for VSS (at least) which has very little
   ## metadata available for delete actions, like user_id and time.
   debug $r->as_string if debugging;

   my $can_add = !@{$cluster->{Revs}} || (
PREAMBLE
) ;
   debug "    can add" if debugging && $can_add;
   return $can_add;
}
POSTAMBLE

   debug @code if debugging;

   unless ( $self->{CAN_ADD_SUB} = eval join "", @code ) {
      my $x =$@;
      lg "$x:\n", @code;
      die $x;
   }
}


sub _compile_add_rev_sub {
   my VCP::Filter::changesets $self = shift;
   my ( $rules ) = @_;

   my @code;
   my @first_rev_code;
   for ( @$rules ) {
      my ( $field, $cond ) = map lc, @$_;

      if ( $cond =~ /\A<=\s*(\d+)\z/ ) {
         push @code, <<CODE;
   {
      my \$v = \$r->$field;
      if ( !empty \$v ) {
          \$cluster->{min_$field} = \$v
             if empty( \$cluster->{min_$field} )
                 || \$v < \$cluster->{min_$field};
          \$cluster->{max_$field} = \$v
             if empty( \$cluster->{max_$field} )
                 || \$v > \$cluster->{max_$field};
      }
   }
CODE
      }
      elsif ( $field eq "branched_rev_branch_id" ) {
         push @code, <<CODE;
   {
      my \$filebranch_id = join "\\001", \$r->name, _d \$r->branch_id;
      die "Can't add two revs for same file: \$filebranch_id"
         if exists \$cluster->{FileBranches}->{\$filebranch_id};
      ++\$cluster->{FileBranches}->{\$filebranch_id};
   }
CODE
      }
      elsif ( $cond eq "equal" ) {
         push @first_rev_code, <<CODE;
      \$cluster->{$field} = \$r->$field;
CODE
      }
      else {
         die "unknown ChangeSets condition in rule: $field $cond\n";
      }
   }

   push @code, ( <<'PRE', @first_rev_code, <<'POST' ) if @first_rev_code;
   if ( !@{$cluster->{Revs}} ) {
PRE
   }
POST

   @code = ( <<'PREAMBLE', @code, <<'POSTAMBLE' );
#line 1 VCP::Filter::changesets::add_rev()
sub {
   my $cluster = shift;
   my ( $r ) = @_;

   debug "adding ", $r->id, " to cluster" if debugging;

PREAMBLE
   push @{$cluster->{Revs}}, $r;
}
POSTAMBLE

   debug @code if debugging;

   unless ( $self->{ADD_REV_SUB} = eval join "", @code ) {
      my $x =$@;
      lg "$x:\n", @code;
      die $x;
   }
}


sub new {
   my $class = ref $_[0] ? ref shift : shift;
   my $self = $class->SUPER::new( @_ ) ;

   ## Parse the options
   my ( $spec, $options ) = @_ ;

   $options ||= [];

   my @rules = $self->parse_rules_list(
      $options, "Field", "Condition",
      [ ## default rules
         [qw( time                   <=60  )],
         [qw( user_id                equal )],
         [qw( comment                equal )],
         [qw( branched_rev_branch_id equal )],
      ]
   );

   $self->_compile_can_add_sub( @rules );
   $self->_compile_add_rev_sub( @rules );

   return $self ;
}

sub filter_name { return "ChangeSets" }


sub is_sort_filter { 1 }


sub handle_header {
   my VCP::Filter::changesets $self = shift;
   $self->revs->set;  ## clear the list
   $self->{REV_COUNT} = 0;
   $self->{CLUSTERS} = [];
   $self->{ALL_HAVE_CHANGE_IDS} = 1;
   $self->{REVS_BY_CHANGE_ID} = {};

   $self->SUPER::handle_header( @_ );
}


sub handle_rev {
   my VCP::Filter::changesets $self = shift;
   my ( $r ) = @_;

   if ( $self->{ALL_HAVE_CHANGE_IDS} ) {
      if ( empty $r->change_id ) {
         if ( $self->{REV_COUNT} ) {
            pr "only first ", $self->{REV_COUNT}, " revisions had change_ids";
         }
         $self->{ALL_HAVE_CHANGE_IDS} = 0;
         %{$self->{REVS_BY_CHANGE_ID}} = ();
      }
      else {
         push @{$self->{REVS_BY_CHANGE_ID}->{$r->change_id}}, $r->id;
      }
   }

   ++$self->{REV_COUNT};
   if ( empty $r->previous_id ) {
      ## It's a root node
      $self->cluster_rev( $r );
   }
   else {
      ## It's a descendant node, note its parentage and stow it for later
      push @{$self->{CHILDREN}->{$r->previous_id}}, $r->id;
      $self->revs->add( $r );  ## TODO: make this a disk store.
   }
}


sub _d($) { defined $_[0] ? $_[0] : "" }

## time is ignored if not present on both because not all revs have times.
## name is required
## user_id should be present on all revs and if it is on one and not on
## the other, we treat the missing one as "" to gather them all.  Not
## sure if that's valid.
sub new_clusters_sort {
   return (
      ( _d $a->user_id cmp _d $b->user_id )
      || ( !empty( $a->time ) && !empty( $b->time ) && ( $a->time <=> $b->time ) )
      || ( $a->name cmp $b->name )
   );
}



sub new_clusters {
   ## Splits apart a cluster that contains illegal combinations of
   ## revisions.  This is done by sorting the revisions, then
   ## walking though them looking for illegal conditions.
   my VCP::Filter::changesets $self = shift;

   ## TODO: optimize this.

   my $can_add = $self->{CAN_ADD_SUB};
   my $add_rev = $self->{ADD_REV_SUB};

   my @clusters;
   my $cur_cluster = VCP::Filter::changesets::cluster->new;
   $add_rev->( $cur_cluster, shift );

   for my $r ( sort new_clusters_sort @_ ) {
      if ( $can_add->( $cur_cluster, $r ) ) {
         $add_rev->( $cur_cluster, $r );
      }
      else {
         debug "bumping ", $r->id, " in to next cluster\n"
            if debugging;
         push @clusters, $cur_cluster;
         $cur_cluster = VCP::Filter::changesets::cluster->new( $r );
      }
   }

   return ( @clusters, $cur_cluster );
}


sub cluster_rev {
   ## Places a rev in to a cluster, merging clusters as appropriate.
   my VCP::Filter::changesets $self = shift;
   my ( $r ) = @_;
   
   my $can_add = $self->{CAN_ADD_SUB};

   my @cs;                   ## clusters $r is not a member of
   my @new_cluster_revs = ( $r ); ## the new cluster containing $r and
                             ## any other clusters $r is a member of
CLUSTER:
   for my $cluster ( @{$self->{CLUSTERS}} ) {
      if ( $can_add->( $cluster, $r ) ) {
         ## Fold all revs in this cluster in to the new cluster
         ## This is provisional: can_add is not transitive, so
         ## we gather them all, then create one or more clusters
         ## below
         push @new_cluster_revs, $cluster->revs;
         next CLUSTER;
      }

      push @cs, $cluster;
   }

   ## Don't make a new anonymous array, aggregate_and_emit_changes
   ## counts on keeping a reference to the same array.
   @{$self->{CLUSTERS}} = sort {
      ( $a->{min_time} || 0 ) <=> ( $b->{min_time} || 0 )
   } ( @cs, $self->new_clusters( @new_cluster_revs ) );
}


sub aggregate_and_emit_changes {
   my VCP::Filter::changesets $self = shift;

   ## Use the ->previous_id references to find the roots and then
   ## reorder the revs by growing up from the roots.

   my %rev_kids;

   my @result;

   my $change_number = 0;

   ## Some shortcuts
   my $clusters = $self->{CLUSTERS};
   my $children = $self->{CHILDREN};

   pr "aggregating changes";

   while ( @$clusters ) {
      debug "clusters:\n",
         map "$_\n",
            map join( "",
               map "     " . $_->id. "\n",
                  sort { $a->name cmp $b->name } $_->revs
            ), @$clusters
         if debugging;

      my ( $change ) = shift @$clusters;

      ++$change_number;
      for my $r ( $change->revs ) {
         $r->change_id( $change_number );
         next unless exists $children->{$r->id};
         for my $id ( @{$children->{$r->id}} ) {
            $self->cluster_rev( $self->revs->get( $id ) );
         }
      }

      lg "change $change_number: " . $change->revs . " revs";

      $self->dest->handle_rev( $_ ) for $change->revs;
   }
}


sub sort_by_change_id_and_send {
   my VCP::Filter::changesets $self = shift;

   pr "sorting by change_id";

   $self->revs->add( $_ )
      for map $_->revs, splice @{$self->{CLUSTERS}};

   ## TODO: handle multiple source repositories.  One day.

   for my $change_id (
      sort {
         VCP::Rev->cmp_id( $a, $b )
      } keys %{$self->{REVS_BY_CHANGE_ID}}
   ) {
      my $rev_ids = delete $self->{REVS_BY_CHANGE_ID}->{$change_id};
      lg "change $change_id" . @$rev_ids . " revs";
      debug "change $change_id:\n", map "    $_\n", @$rev_ids;
      my @revs;
      for ( @$rev_ids ) {
         my $r = $self->revs->get( $_ );
         BUG "vcp: $_ not found" unless $r;
         push @revs, $r;
      }

      for my $r ( sort {
            _d $a->branch_id cmp _d $b->branch_id
                        ||
               $a->name cmp $b->name
         } @revs
      ) {
         $self->dest->handle_rev( $r );
      }
   }

   $self->revs->set;  ## empty the revs database
}


sub handle_footer {
   my VCP::Filter::changesets $self = shift;

   $self->SUPER::rev_count( $self->{REV_COUNT} );

   $self->{ALL_HAVE_CHANGE_IDS}
      ? $self->sort_by_change_id_and_send
      : $self->aggregate_and_emit_changes;

   $self->SUPER::handle_footer( @_ );
}

package VCP::Filter::changesets::cluster;

use VCP::Utils qw( empty );

sub new {
   my $class = ref $_[0] ? ref shift : shift;
   my $self = bless {
      FileBranches => {},
      Revs         => [],
   }, $class;

   return $self;
}


sub _d($) { defined $_[0] ? $_[0] : "" }


sub revs {
   my $self = shift;
   return @{$self->{Revs}};
}

=head1 LIMITATIONS

This filter does not take the source_repo_id in to account: if somehow
you are merging multiple repositories in to one and want to interleave
the commits/submits "properly", ask for advice.

=head1 AUTHOR

Barrie Slaymaker <barries@slaysys.com>

=head1 COPYRIGHT

Copyright (c) 2000, 2001, 2002 Perforce Software, Inc.
All rights reserved.

See L<VCP::License|VCP::License> (C<vcp help license>) for the terms of use.

=cut

1

#	Change	User	Description
#31	5403	Barrie Slaymaker	- Misc logging, maintainability & debugging improvements
#30	4516	Barrie Slaymaker	- VCP::Filter::changesets supports VCP::Rev::release_ids
#29	4507	Barrie Slaymaker	- RevML: - added <action>, removed <delete>, <placeholder> and <move> - added <from_id> for clones (and eventually merge actions) - Simplified DTD (can't branch DTD based on which action any more) - VCP::Source::cvs, VCP::Filter::changesets and VCP::Dest::p4 support from_id in <action>clone</action> records - VCP::Dest::perl_data added - VCP::Rev::action() "branch" added, no more undefined action strings - "placeholder" action removed
#28	4475	Barrie Slaymaker	- my ... if code replaced (thanks, clkao) - exists added (thanks, clkao)
#27	4079	Barrie Slaymaker	- ChangeSets filter is now 40% or so more memory efficient on large repositories
#26	4021	Barrie Slaymaker	- Remove all phashes and all base & fields pragmas - Work around SWASHGET error
#25	4012	Barrie Slaymaker	- Remove dependance on pseudohashes (deprecated Perl feature)
#24	3996	Barrie Slaymaker	- Make unknown rev times be ASAP
#23	3970	Barrie Slaymaker	- VCP::Source handles rev queing, uses disk to reduce RAM - Lots of other fixes
#22	3942	Barrie Slaymaker	- ChangeSets now passes tests
#21	3930	Barrie Slaymaker	- VCP::Source::cvs and VCP::Dest::p4 handle cloning deletes - "placeholder" actions and is_placeholder_rev() deprecated in favor of is_branch_rev() and is_clone_rev(). - Misc cleanups and minor bugfixes
#20	3916	Barrie Slaymaker	- Reduce memory consumption
#19	3911	Barrie Slaymaker	- ChangeSets no longer deletes a required ARRAY ref
#18	3906	Barrie Slaymaker	- ChangeSets filter now copes with cloned revs
#17	3903	Barrie Slaymaker	- Minor debug logging impovements
#16	3902	Barrie Slaymaker	- VCP::Filter::changesets does not aggregate creation of multiple branches
#15	3855	Barrie Slaymaker	- vcp scan, filter, transfer basically functional - Need more work in re: storage format, etc, but functional
#14	3850	Barrie Slaymaker	- No longer stores all revs in memory
#13	3830	Barrie Slaymaker	- VCP::Filter::changesets now ready for disk storage
#12	3807	Barrie Slaymaker	- changesets filter no longer needs VCP::Rev::previous()
#11	3769	Barrie Slaymaker	- avg_comment_time sort key removed
#10	3762	Barrie Slaymaker	- Changeset aggregation now works even when not debugging
#9	3737	Barrie Slaymaker	- Missing / empty data fields other than branch_id no longer affect changeset aggregation - Needed for VSS, which lacks time information on deletes - If all revs have change_ids, they are now sorted by change_id and name (as opposed to change_id, rev_id) - VCP::Rev::sort_time() removed - VCP::Filter::changesets has better debugging - TestUtils now dumps large diffs
#8	3706	Barrie Slaymaker	- VCP gives some indication of output progress (need more)
#7	3702	Barrie Slaymaker	- Sort and ChangeSets filters now allow unneeded VCP::Rev objects to be garbage collected.
#6	3496	Barrie Slaymaker	- VSS branching
#5	3465	Barrie Slaymaker	- VCP::Filters can now dump they're sections.
#4	3160	Barrie Slaymaker	Emit "sorting by change_id" to STDERR, too.
#3	3155	Barrie Slaymaker	Convert to logging using VCP::Logger to reduce stdout/err spew. Simplify & speed up debugging quite a bit. Provide more verbose information in logs. Print to STDERR progress reports to keep users from wondering what's going on. Breaks test; halfway through upgrading run3() to an inline function for speed and for VCP specific features.
#2	3127	Barrie Slaymaker	Minor cleanups.
#1	3120	Barrie Slaymaker	Move changeset aggregation in to its own filter.