Wednesday, May 18, 2011

SubVersion - splitting apart a very large repository

Back when we started using SVN in 2006, we went for ease-of-use and easy administration by putting all of our projects into a single repository.  At the time it was a few gigabytes in size and not a big deal.  Fast-forward 4 years and we're starting to wish we had split the repository up by client / project boundaries.  The tree looked like:

/A/ClientA/ProjectA1
/A/ClientA/ProjectA2
/A/ClientA2/ProjectA2A
/B/ClientB/ProjectB1
...

So my current project is to take the 18GB repository with about 13,000 revisions and split it out and re-base the paths so that the project directories are the top level of the repository.  Unfortunately, over the years, files have been copied / moved, folders have vanished / moved / been renamed, etc., so there's the potential for interesting fun.  This is made even trickier since we're doing the re-base a few levels down.

Warning: When you do a split, the default result is that the new split repositories will have the same SVN repository UUID (unique ID) as the original repository.  That is why the last step in this process is "svnadmin setuuid /path/to/new/repo".  You can see the UUID of an existing repository by using "svnlook uuid /path/to/repo".

Step 1: Raw Dump

First off, I suggest making a raw dump of the original repository, piped through 'gzip' which will make the next few steps faster.  Naturally, if anyone commits things to the old repository after this point those changes won't be migrated.  So you will want to address that issue by limiting access to the original repository, or work on the repository in sections and periodically update your raw dump to capture new changes before you start on the next section.  For our purposes, we simply said "these particular projects are off-limits until Thursday" and worked on a set of projects each week.

Note: all of the following is a single command.

# svnadmin dump --quiet /path/to/svn-repo |
gzip > /path/to/svn-raw-dump-svn-repo.may2011.dump.gz

That will create a .gz file that is about 30-50% larger then the old repository.  Our gzip'd dump file ended up at 44% larger (26GB vs 18GB).  Without gzip, the uncompressed dump file would have been a lot larger (between 5x and 6x larger then the gzip'd file).  The main benefits are that it gives you a static source to work with, shortens up the later command lines slightly, and it's easier to see how all this works if you do it bit by bit.  You'll probably want to also copy that .gz file off to permanent archival storage after this is all done. 

(bzip2 would have created a 15-20% smaller file, but it also would have taken 2x-3x longer to create the file.  As it is, the CPU was the bottleneck for creating this initial dump file and is the bottleneck in some other steps as well.)

Step 2: Filtering the dump file

This process breaks out the single project directory that we want and puts it in its own dump file.  We will repeat this command once for every project that we're breaking out to a separate repository.  We drop any empty revisions and renumber those that remain during this set.  It will renumber the revisions starting at 1 and the new file will end up with a much lower revision count.  We're not adjusting the paths within the repository during this step.


# gunzip -c /path/to/svn-raw-dump-svn-repo.may2011.dump.gz |
svndumpfilter include --quiet --drop-empty-revs  --renumber-revs 
A/ClientA/ProjectA1 > /var/svn/svn-raw-ClientA-ProjectA1.dump

Notes:
- Leave the leading '/' off of the path that you want to include.
- Leave the trailing '/' off of the path that you want to include.

Running a search on the new dump file reveals the new revision numbers.

# grep 'Revision-number' /var/svn/svn-raw-ClientA-ProjectA1.dump
...
Revision-number: 60
Revision-number: 61
Revision-number: 62

Note that if you attempt to load the project-specific dump file into a new repository at this point it will fail.  That is because the parent directories do not exist in the repository that you are loading into.  But if you create those parent folders, you can then import the dump file into the new repository at this point.  I suggest creating a new scratch repository with "svnadmin create /var/svn/ProjectA-Test1", create the necessary parent folders, then do a "svnadmin load /path/to/repo < /path/to/dump" to verify that you understand this step.


Step 3: Re-basing the project

Note: Depending on how many folder renames are in your original repository, you may have lots of trouble with the following.  In which case you should skip this and just load the dump file into the new repository without re-basing the paths.  Don't forget to change the UUID on the new repositories after loading.

The next step is to move A/ClientA/ProjectA1 back to the root of the repository during the import process.  We will do this by editing the dump file with 'sed' before loading it back in.  In the dump file, there are two types of lines that contain path information.  One starts with 'Node-path:' and the other starts with 'Node-copyfrom-path:'.  This is how 'svnadmin load' keeps track of what goes where in the repository tree.

# grep '^Node-path:' /var/svn/svn-raw-ClientA-ProjectA1.dump
Node-path: A/ClientA/ProjectA1
Node-path: A/ClientA/ProjectA1/Data
Node-path: A/ClientA/ProjectA1/Doc
Node-path: A/ClientA/ProjectA1/Trunk
...
# grep '^Node-copyfrom-path:' /var/svn/svn-raw-ClientA-ProjectA1.dump


Notes:
- There is never a leading slash ('/') and never a trailing slash ('/').
- The Node-path: argument cannot be empty.
- The parent directory must already exist in the SVN repository in order for a load to succeed.  So in order to load the above node paths, you would have to manually create the "A/ClientA" directory tree first.


As stated, we can use 'sed' to transform these path names on the fly.  And the following set of lines is all a single command.

# cat /var/svn/svn-raw-ClientA-ProjectA1.dump |
sed 's/Node-path: A\/ClientA\//Node-path: /' |
sed 's/Node-copyfrom-path: A\/ClientA\//Node-copyfrom-path: /' >
/var/svn/svn-newbase-ClientA-ProjectA1.dump

So if a line reads "Node-path: A/ClientA/ProjectA1" in the input, it will look like "Node-path: ProjectA1" in the output.

Now you can load this into the new repository.

# svnadmin load /path/to/new/repo < /var/svn/svn-newbase-ClientA-ProjectA1.dump


Step 4: Changing the UUID of the new repository

As I mentioned before, when you do a split like this, the repository UUID will end up as the UUID of the original repository after the "svnadmin load" step.  You can verify this behavior using the "svnlook uuid /path/to/repo" command.  You can change the UUID manually, or just have a new one assigned automatically with the "svnadmin setuuid" command.

# svnadmin setuuid /path/to/new/repo

Step 5: Verify the new repository, make backups

After you load the new repository, take an hour and verify that all of the project folders made it intact and that the version history is intact.  Then make a backup of the new repository.

3 comments:

EJ said...

What types of performance improvements have you seen after the split?

Thomas said...

The primary reason that we split it was to make the backups more reliable / easier. Some of our repositories were getting up over 10-15GB in size. Smaller repositories are easier to backup, verify, clean, dump/load, etc.

The bigger performance improvement was from the dump/reload, going from the older 1.4 or 1.5 repository format to the newer 1.6 or 1.7 formats. In 1.6, they introduced "packing" for revs. In 1.7 they are still working on allowing the revprops files to be packed.

Ptica said...

I have tried your procedure. I have big svn repo where I have PROJ1, PROJ2, PROJn folders. I have successfully filtered only PROJ1. Then I have tried to remove PROJ1 from tree structure. It went well, the only thing is that I have also empty dir PROJ1. Am I doing something wrog? My command was: cat proj1_dump | sed 's/Node-path: proj1\//Node-path: /' | sed 's/Node-copyfrom-path: proj1\//Node-copyfrom-path: /' > proj1_reroot_dump