package VCP::Filter::changesets; =head1 NAME VCP::Filter::changesets - Group revs in to changesets =head1 SYNOPSIS ## From the command line: vcp <source> changesets: ...options... -- <dest> ## In a .vcp file: ChangeSets: time <=60 ## seconds user_id equal ## case-sensitive equality comment equal ## case-sensitive equality branched_rev_branch_id equal ## change only one branch at a time =head1 DESCRIPTION This filter is automatically loaded when there is no sort filter loaded (both this and L<VCP::Filter::sort|VCP::Filter::sort> count as sort filters). =head2 Sorting by change_id, etc. When all revs from the source have change numbers, this filter sorts by change_id, branch_id, and name, regardless of the rules set. The name sort is case sensitive, though it should not be for Win32. This sort by change_id is necessary for sources that supply change_id because the order of scanning the revisions is not usually (ever, so far :) in change set order. =head2 Aggregating changes If one or more revisions arrives from the source with an empty change_id, the rules for this filter establish the conditions that determine what revisions may be grouped in to each change. In this case, this filter rewrites all change_id fields so that the (eventual) destination can use the change_id field to break the revisions in to changes. This is sometimes used by non-changeset oriented destinations to aggregate "changes" as though a user were performing them and to reduce the number of individual operations the destination driver must perform (for instance: VCP::Dest::cvs prefers to not call cvs commit all the time; cvs commit is slow). If you don't specify any conditions, a set of default conditions are used. If you specify any conditions at all, none of the default conditions are used. The conditions are specified as a set of rules. Each rule is a pair of words: a the name of the criterion (usually a data field of the revision) and the compatability specification. There are two compatability specifications: "equal" and "<=N" (where N is a number; note that no spaces are allowed before the number unless the spec is quoted somehow). "equal" states that all revisions in the same change must have identical values for the indicated field. So: user_id equal states that all revisions in a change must be submitted by the same user. The "equal" specification is available for all criterial. The "<=N" specification is only available for the "time" field. It specifices that no gaps larger than N seconds may exist in a change. The default conditions are: time <=60 ## seconds user_id equal ## case-sensitive equality comment equal ## case-sensitive equality branched_rev_branch_id equal ## change only one branch at a time The C<time <=60> condition sets a maximum allowable difference between two revisions; revisions that are more than this number of seconds apart are considered to be in different changes. The C<user_id equal> and C<comment equal> conditions assert that two revisions must be by the same user and have the same comment in order to be in the same change. The C<branched_rev_branch_id equal> condition is a special case to handle repositories like CVS which don't record branch creation times. This condition kicks in when a user creates several branches before changing any files on any of them then all of the branches get created at the same time. This condition also kicks in when multiple CVS branches exist with no changes on them. In this case, VCP::Source::cvs groups all of the branch creations after the last "real" edit. The C<branched_rev_branch_id> condition only applies to revisions branching from one branch in to another. For all fields but C<time> and C<mod_time>, you may (for now) only use the C<equal> condition, which is case sensitive equality. An implicit condition is that a parent and a child may not both be altered in the same change. =head1 NEW ALGORITHM As revs are received by handle_rev(), they are store on disk. An in-memory data structure (for now) is built that notes each revision's child revisions. In handle_footer, a clustering algorithm begins. Starting with the set of parentless revisions, =over =item 1 each revision is placed in a change cluster. A change cluster is the set of all revisions in the same change. =item 2 the most mature change cluster is identified and taken as a change. "mature" means the cluster containing the oldest revision across all the change clusters, as ordered by the set of keys (change_id, time, revision number, name, comment), where empty fields are ignored. =item 3 All revs in the cluster are given a new change_id, starting with "1" and sent downstream. =item 4 The (possibly empty) set of children of the revisions from the newly minted change is taken as the new set of revs, and the process loops back to step 1 unless there are no more clusters. =back =begin developer_docs A few thoughts on clustering ============================ This algorithm is a simplified k-clustering algorithm: it divides the set of revisions in to clusters and emits each cluster as a changeset by labelling each cluster with an integer (1..N) change number. The labelling is done by setting the change_id for all revs in a cluster to this change number. Changes are sent downstream from in order from (1..N) as they are identified. We never want a cluster to contain... - two revisions for the same filebranch. - This means that a cluster should never contain two files with $r->name and $r->branch_id both equal. - two revisions where one is a descendant of another. Additionally, by default we never want a cluster to contain... - two revisions from different repositories - two revisions from different users - gaps in time greater than 60 seconds NOTE: we don't take source_repo_id in to account because we don't need to handle that case now and it complicates the logic when all revs do have change_ids. We'll let the user figure out how to differentiate that case or add a different filter. At that time, we should also add a "repo_id" field to VCP::Rev. We never want a revision to be in the same cluster (or change) as any of its ancestors or descendants. Because the revisions are all linked to their parents when they arrive, this filter can identify the "root" revisions (those with no parents) and cluster them first. It may then identify a single cluster as a change and remove it from consideration--this becomes change 1, and is emmited immediately--then gather the child revisions of that change and cluster those with the existing clusters, then select and emit change 2. So, instead of accepting an unstructured soup of revisions and needing to perform a generalized k-clustering algorithm, this algorithm accepts an multirooted acyclic digraph and incrementally clusters it by starting at the roots and proceeding in a wavefront through the entire data set. Furthermore, instead of using a real number as the cost, as most generalized clustering algorithms do, this algorithm can use a simple, transitive test function. So to determine if an unclustered "new" revision should be added to an existing cluster it suffices to consider each revision in the cluster and if the test function returns true for any of the revisions in the cluster, the new revision is a member of the cluster. Likewise, if the new revision is a member of more than one cluster, those clusters may be merged in to a single cluster because the test is transitive. The test function ----------------- This filter uses a boolean function to determine whether two revs are in the same change or not. If they are, it returns TRUE, if they are not, it returns FALSE. The test function does not need to test for parent/child relationships, the algorithm will never consider two revisions that have an ancestry relationship between them. No child may have more than one parent (children may have only one previous_id), and children of already clustered revisions are not compared with the revisions of the parent cluster. As far as transitivity goes, there are two types of tests applied in the comparison function: equality and value range. Equality is transitive, of course. The value range is only used for the time field, and we specifically want this value to be transitive: if rev A is close enough in time to rev B, and rev B is close enough in time to rev C, then we want rev A and rev C to be considered to be part of the same change. This is so that a set of files submitted over a period of time (as may happen in systems like VSS or CVS) can still result in a change relationship. The test function is compiled in to an anonymous sub and placed in $self->{CAN_ADD_SUB}. =end developer_docs =cut $VERSION = 1 ; use strict ; use VCP::Logger qw( lg pr BUG ); use VCP::Debug qw( :debug ); use VCP::Utils qw( empty ); use VCP::Filter; use VCP::Rev; use base qw( VCP::Filter ); use fields ( 'CAN_ADD_SUB', ## Return TRUE if a rev can be added to a sub. 'ADD_REV_SUB', ## Adds a rev to a cluster 'COMMENT_TIMES', ## The average time of all instances of a comment 'HAS_CHANGE_IDS', ## set if all revs have nonempty change_ids 'CLUSTERS', ## an ARRAY of clusters. 'CHILDREN', ## A HASH keyed on a rev's id of ARRAYs of the ## rev's children's ids. 'REV_COUNT', ## How many revs we received 'ALL_HAVE_CHANGE_IDS', ## Set if all incoming revs have change_ids 'REVS_BY_CHANGE_ID', ## HASH of change_id => \@rev_ids ); sub _eq { defined $_[0] ? defined $_[1] ? $_[0] eq $_[1] : 0 : ! defined $_[1]; } sub _compile_can_add_sub { my VCP::Filter::changesets $self = shift; my ( $rules ) = @_; my @cmps = join "\n && ", map { my ( $field, $cond ) = map lc, @$_; my $cmp; if ( ( $field eq "time" || $field eq "mod_time" ) && $cond =~ /\A<=\s*(\d+)\z/ ) { my $limit = $1; $cmp = <<CMP; ( empty( \$r->time ) || ( ( empty( \$cluster->{min_time} ) || \$r->time >= \$cluster->{min_time} - $limit ) && ( empty( \$cluster->{max_time} ) || \$r->time <= \$cluster->{max_time} + $limit ) ) || ( debugging and debug " times: ", \$cluster->{min_time}, " - $limit <= ", \$r->time, " <= ", \$cluster->{max_time}, " + $limit" and 0 ) ) CMP } elsif ( $field eq "branched_rev_branch_id" && $cond eq "equal" ) { $cmp = <<'CMP'; ( ( $r->is_placeholder_rev || do { my $filebranch_id = join "\001", $r->name, _d $r->branch_id; !exists $cluster->{FileBranches}->{$filebranch_id} } ) || ( debugging and debug " !is_placeholder_rev or filebranch_id seen" and 0 ) ) CMP chomp $cmp; } elsif ( $cond eq "equal" ) { $cmp = <<CMP; ( _eq( \$cluster->{$field}, \$r->$field ) || ( debugging and debug " $field differs from '", \$cluster->{$field}, "'" and 0 ) ) CMP } else { die "unknown ChangeSets condition in rule: $field $cond\n"; } $cmp; } @$rules; my @code = ( <<'PREAMBLE', " ", @cmps, <<'POSTAMBLE' ); #line 1 VCP::Filter::changesets::can_add() sub { my $cluster = shift; my ( $r ) = @_; ## All the empty()s are for VSS (at least) which has very little ## metadata available for delete actions, like user_id and time. debug $r->as_string if debugging; my $can_add = !@{$cluster->{Revs}} || ( PREAMBLE ) ; debug " can add" if debugging && $can_add; return $can_add; } POSTAMBLE debug @code if debugging; unless ( $self->{CAN_ADD_SUB} = eval join "", @code ) { my $x =$@; lg "$x:\n", @code; die $x; } } sub _compile_add_rev_sub { my VCP::Filter::changesets $self = shift; my ( $rules ) = @_; my @code; my @first_rev_code; for ( @$rules ) { my ( $field, $cond ) = map lc, @$_; if ( $cond =~ /\A<=\s*(\d+)\z/ ) { push @code, <<CODE; { my \$v = \$r->$field; if ( !empty \$v ) { \$cluster->{min_$field} = \$v if empty( \$cluster->{min_$field} ) || \$v < \$cluster->{min_$field}; \$cluster->{max_$field} = \$v if empty( \$cluster->{max_$field} ) || \$v > \$cluster->{max_$field}; } } CODE } elsif ( $field eq "branched_rev_branch_id" ) { push @code, <<CODE; { my \$filebranch_id = join "\\001", \$r->name, _d \$r->branch_id; die "Can't add two revs for same file: \$filebranch_id" if exists \$cluster->{FileBranches}->{\$filebranch_id}; ++\$cluster->{FileBranches}->{\$filebranch_id}; } CODE } elsif ( $cond eq "equal" ) { push @first_rev_code, <<CODE; \$cluster->{$field} = \$r->$field; CODE } else { die "unknown ChangeSets condition in rule: $field $cond\n"; } } push @code, ( <<'PRE', @first_rev_code, <<'POST' ) if @first_rev_code; if ( !@{$cluster->{Revs}} ) { PRE } POST @code = ( <<'PREAMBLE', @code, <<'POSTAMBLE' ); #line 1 VCP::Filter::changesets::add_rev() sub { my $cluster = shift; my ( $r ) = @_; debug "adding ", $r->id, " to cluster" if debugging; PREAMBLE push @{$cluster->{Revs}}, $r; } POSTAMBLE debug @code if debugging; unless ( $self->{ADD_REV_SUB} = eval join "", @code ) { my $x =$@; lg "$x:\n", @code; die $x; } } sub new { my $class = ref $_[0] ? ref shift : shift; my $self = $class->SUPER::new( @_ ) ; ## Parse the options my ( $spec, $options ) = @_ ; $options ||= []; my @rules = $self->parse_rules_list( $options, "Field", "Condition", [ ## default rules [qw( time <=60 )], [qw( user_id equal )], [qw( comment equal )], [qw( branched_rev_branch_id equal )], ] ); $self->_compile_can_add_sub( @rules ); $self->_compile_add_rev_sub( @rules ); return $self ; } sub filter_name { return "ChangeSets" } sub is_sort_filter { 1 } sub handle_header { my VCP::Filter::changesets $self = shift; $self->revs->set; ## clear the list $self->{REV_COUNT} = 0; $self->{CLUSTERS} = []; $self->{ALL_HAVE_CHANGE_IDS} = 1; $self->{REVS_BY_CHANGE_ID} = {}; $self->SUPER::handle_header( @_ ); } sub handle_rev { my VCP::Filter::changesets $self = shift; my ( $r ) = @_; if ( $self->{ALL_HAVE_CHANGE_IDS} ) { if ( empty $r->change_id ) { if ( $self->{REV_COUNT} ) { pr "only first ", $self->{REV_COUNT}, " revisions had change_ids"; } $self->{ALL_HAVE_CHANGE_IDS} = 0; %{$self->{REVS_BY_CHANGE_ID}} = (); } else { push @{$self->{REVS_BY_CHANGE_ID}->{$r->change_id}}, $r->id; } } ++$self->{REV_COUNT}; if ( empty $r->previous_id ) { ## It's a root node $self->cluster_rev( $r ); } else { ## It's a descendant node, note its parentage and stow it for later push @{$self->{CHILDREN}->{$r->previous_id}}, $r->id; $self->revs->add( $r ); ## TODO: make this a disk store. } } sub _d($) { defined $_[0] ? $_[0] : "" } ## time is ignored if not present on both because not all revs have times. ## name is required ## user_id should be present on all revs and if it is on one and not on ## the other, we treat the missing one as "" to gather them all. Not ## sure if that's valid. sub new_clusters_sort { return ( ( _d $a->user_id cmp _d $b->user_id ) || ( !empty( $a->time ) && !empty( $b->time ) && ( $a->time <=> $b->time ) ) || ( $a->name cmp $b->name ) ); } sub new_clusters { ## Splits apart a cluster that contains illegal combinations of ## revisions. This is done by sorting the revisions, then ## walking though them looking for illegal conditions. my VCP::Filter::changesets $self = shift; ## TODO: optimize this. my $can_add = $self->{CAN_ADD_SUB}; my $add_rev = $self->{ADD_REV_SUB}; my @clusters; my $cur_cluster = VCP::Filter::changesets::cluster->new; $add_rev->( $cur_cluster, shift ); for my $r ( sort new_clusters_sort @_ ) { if ( $can_add->( $cur_cluster, $r ) ) { $add_rev->( $cur_cluster, $r ); } else { debug "bumping ", $r->id, " in to next cluster\n" if debugging; push @clusters, $cur_cluster; $cur_cluster = VCP::Filter::changesets::cluster->new( $r ); } } return ( @clusters, $cur_cluster ); } sub cluster_rev { ## Places a rev in to a cluster, merging clusters as appropriate. my VCP::Filter::changesets $self = shift; my ( $r ) = @_; my $can_add = $self->{CAN_ADD_SUB}; my @cs; ## clusters $r is not a member of my @new_cluster_revs = ( $r ); ## the new cluster containing $r and ## any other clusters $r is a member of CLUSTER: for my $cluster ( @{$self->{CLUSTERS}} ) { if ( $can_add->( $cluster, $r ) ) { ## Fold all revs in this cluster in to the new cluster ## This is provisional: can_add is not transitive, so ## we gather them all, then create one or more clusters ## below push @new_cluster_revs, $cluster->revs; next CLUSTER; } push @cs, $cluster; } ## Don't make a new anonymous array, aggregate_and_emit_changes ## counts on keeping a reference to the same array. @{$self->{CLUSTERS}} = sort { ( $a->{min_time} || 0 ) <=> ( $b->{min_time} || 0 ) } ( @cs, $self->new_clusters( @new_cluster_revs ) ); } sub aggregate_and_emit_changes { my VCP::Filter::changesets $self = shift; ## Use the ->previous_id references to find the roots and then ## reorder the revs by growing up from the roots. my %rev_kids; my @result; my $change_number = 0; ## Some shortcuts my $clusters = $self->{CLUSTERS}; my $children = $self->{CHILDREN}; pr "aggregating changes"; while ( @$clusters ) { debug "clusters:\n", map "$_\n", map join( "", map " " . $_->id. "\n", sort { $a->name cmp $b->name } $_->revs ), @$clusters if debugging; my ( $change ) = shift @$clusters; ++$change_number; for my $r ( $change->revs ) { $r->change_id( $change_number ); next unless exists $children->{$r->id}; for my $id ( @{$children->{$r->id}} ) { $self->cluster_rev( $self->revs->get( $id ) ); } } lg "change $change_number: " . $change->revs . " revs"; $self->dest->handle_rev( $_ ) for $change->revs; } } sub sort_by_change_id_and_send { my VCP::Filter::changesets $self = shift; pr "sorting by change_id"; $self->revs->add( $_ ) for map $_->revs, splice @{$self->{CLUSTERS}}; ## TODO: handle multiple source repositories. One day. for my $change_id ( sort { VCP::Rev->cmp_id( $a, $b ) } keys %{$self->{REVS_BY_CHANGE_ID}} ) { my $rev_ids = delete $self->{REVS_BY_CHANGE_ID}->{$change_id}; lg "change $change_id" . @$rev_ids . " revs"; debug "change $change_id:\n", map " $_\n", @$rev_ids; my @revs; for ( @$rev_ids ) { my $r = $self->revs->get( $_ ); BUG "vcp: $_ not found" unless $r; push @revs, $r; } for my $r ( sort { _d $a->branch_id cmp _d $b->branch_id || $a->name cmp $b->name } @revs ) { $self->dest->handle_rev( $r ); } } $self->revs->set; ## empty the revs database } sub handle_footer { my VCP::Filter::changesets $self = shift; $self->SUPER::rev_count( $self->{REV_COUNT} ); $self->{ALL_HAVE_CHANGE_IDS} ? $self->sort_by_change_id_and_send : $self->aggregate_and_emit_changes; $self->SUPER::handle_footer( @_ ); } package VCP::Filter::changesets::cluster; use VCP::Utils qw( empty ); sub new { my $class = ref $_[0] ? ref shift : shift; my $self = bless { FileBranches => {}, Revs => [], }, $class; return $self; } sub _d($) { defined $_[0] ? $_[0] : "" } sub revs { my $self = shift; return @{$self->{Revs}}; } =head1 LIMITATIONS This filter does not take the source_repo_id in to account: if somehow you are merging multiple repositories in to one and want to interleave the commits/submits "properly", ask for advice. =head1 AUTHOR Barrie Slaymaker <barries@slaysys.com> =head1 COPYRIGHT Copyright (c) 2000, 2001, 2002 Perforce Software, Inc. All rights reserved. See L<VCP::License|VCP::License> (C<vcp help license>) for the terms of use. =cut 1
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#31 | 5403 | Barrie Slaymaker | - Misc logging, maintainability & debugging improvements | ||
#30 | 4516 | Barrie Slaymaker | - VCP::Filter::changesets supports VCP::Rev::release_ids | ||
#29 | 4507 | Barrie Slaymaker |
- RevML: - added <action>, removed <delete>, <placeholder> and <move> - added <from_id> for clones (and eventually merge actions) - Simplified DTD (can't branch DTD based on which action any more) - VCP::Source::cvs, VCP::Filter::changesets and VCP::Dest::p4 support from_id in <action>clone</action> records - VCP::Dest::perl_data added - VCP::Rev::action() "branch" added, no more undefined action strings - "placeholder" action removed |
||
#28 | 4475 | Barrie Slaymaker |
- my ... if code replaced (thanks, clkao) - exists added (thanks, clkao) |
||
#27 | 4079 | Barrie Slaymaker |
- ChangeSets filter is now 40% or so more memory efficient on large repositories |
||
#26 | 4021 | Barrie Slaymaker |
- Remove all phashes and all base & fields pragmas - Work around SWASHGET error |
||
#25 | 4012 | Barrie Slaymaker | - Remove dependance on pseudohashes (deprecated Perl feature) | ||
#24 | 3996 | Barrie Slaymaker | - Make unknown rev times be ASAP | ||
#23 | 3970 | Barrie Slaymaker |
- VCP::Source handles rev queing, uses disk to reduce RAM - Lots of other fixes |
||
#22 | 3942 | Barrie Slaymaker | - ChangeSets now passes tests | ||
#21 | 3930 | Barrie Slaymaker |
- VCP::Source::cvs and VCP::Dest::p4 handle cloning deletes - "placeholder" actions and is_placeholder_rev() deprecated in favor of is_branch_rev() and is_clone_rev(). - Misc cleanups and minor bugfixes |
||
#20 | 3916 | Barrie Slaymaker | - Reduce memory consumption | ||
#19 | 3911 | Barrie Slaymaker | - ChangeSets no longer deletes a required ARRAY ref | ||
#18 | 3906 | Barrie Slaymaker | - ChangeSets filter now copes with cloned revs | ||
#17 | 3903 | Barrie Slaymaker | - Minor debug logging impovements | ||
#16 | 3902 | Barrie Slaymaker |
- VCP::Filter::changesets does not aggregate creation of multiple branches |
||
#15 | 3855 | Barrie Slaymaker |
- vcp scan, filter, transfer basically functional - Need more work in re: storage format, etc, but functional |
||
#14 | 3850 | Barrie Slaymaker | - No longer stores all revs in memory | ||
#13 | 3830 | Barrie Slaymaker | - VCP::Filter::changesets now ready for disk storage | ||
#12 | 3807 | Barrie Slaymaker | - changesets filter no longer needs VCP::Rev::previous() | ||
#11 | 3769 | Barrie Slaymaker | - avg_comment_time sort key removed | ||
#10 | 3762 | Barrie Slaymaker | - Changeset aggregation now works even when not debugging | ||
#9 | 3737 | Barrie Slaymaker |
- Missing / empty data fields other than branch_id no longer affect changeset aggregation - Needed for VSS, which lacks time information on deletes - If all revs have change_ids, they are now sorted by change_id and name (as opposed to change_id, rev_id) - VCP::Rev::sort_time() removed - VCP::Filter::changesets has better debugging - TestUtils now dumps large diffs |
||
#8 | 3706 | Barrie Slaymaker | - VCP gives some indication of output progress (need more) | ||
#7 | 3702 | Barrie Slaymaker |
- Sort and ChangeSets filters now allow unneeded VCP::Rev objects to be garbage collected. |
||
#6 | 3496 | Barrie Slaymaker | - VSS branching | ||
#5 | 3465 | Barrie Slaymaker | - VCP::Filters can now dump they're sections. | ||
#4 | 3160 | Barrie Slaymaker | Emit "sorting by change_id" to STDERR, too. | ||
#3 | 3155 | Barrie Slaymaker |
Convert to logging using VCP::Logger to reduce stdout/err spew. Simplify & speed up debugging quite a bit. Provide more verbose information in logs. Print to STDERR progress reports to keep users from wondering what's going on. Breaks test; halfway through upgrading run3() to an inline function for speed and for VCP specific features. |
||
#2 | 3127 | Barrie Slaymaker | Minor cleanups. | ||
#1 | 3120 | Barrie Slaymaker | Move changeset aggregation in to its own filter. |