Troy Hunt: The black art of splitting a Subversion repository

Here’s the scenario; you have a Subversion repository that has been doing some multitasking. For whatever reason (convenience, laziness, ignorance), the one repository was used to store multiple projects and having now seen the light you want to split it out into separate repositories. This is entirely possible but it takes a bit of work and in many cases, quite a bit of trouble shooting (the kind you won’t normally find in the SVN books). I’ve done this a number of times recently and learnt a lot in the process so I thought I’d capture this info so that hopefully others can avoid some of the pain I’ve been through!

Background

We’ll assume we have a single repository with a folder in the root called “Websites”. This folder then contains multiple projects each in their own folder named after the particular project. For the purposes of this post we’ll assume the path is “ProjectPath”. During the lifecycle of the project it may have been moved around between different folders or even had content from other projects moved into it.

Process

There’s a three step process involved in splitting a repo:

Dump the source repository
Filter the dump to extract the project
Restore the filtered dump to a new repo

In a perfect world we’d simply do all of this in one go but unfortunately it’s a bit more involved than that. It needs to be done in the three steps and it all needs to happen from the command line using svnadmin.

Dumping

Creating a dump of a repo involves svnadmin enumerating through the revisions in the repo and extracting it out into a dump file. An important point to make clear right now; Subversion is very efficient in terms of the repo compression algorithm it uses. A dump file is pure uncompressed plain text (expect of course for any binaries it contains). The bottom line is to ensure you have plenty of free space available on the drive you plan on using for this process.

To do the dump we’re going to use the dump subcommand. The syntax for this looks like the following:

svnadmin dump REPOS_PATH [-r LOWER[:UPPER]] [—incremental]

The lower and upper bands are quite handy if you only want to take a particular range of revisions from the repo. This can make the process quite a bit faster and consume a lot less space so if your project only occupies a small window of time in the repo history then try and use these switches. The particular project I’m going to use has revisions spread out over pretty much the entire revision history so I’m going to leave these out.

Depending on the size of your repo, you could be in for a long wait but at the end of it you’ll have a file called Websites.bak containing your entire repository history.

Filtering the dump

This is the first step where things can start to go wrong. What we need to do is to try and extract just a portion of the dump to create a brand new dump and we can do this by either taking the white list approach and explicitly including paths or the blacklist approach which means excluding paths. Either way, we’re going to use the svndumpfilter command to specify the filter approach, the dump we just created, the dump we want to create from the filter and one more parameter I’ll explain shortly.

When the dump is filtered, there are going to be revisions where “ProjectPath” doesn’t have any changes because it was related to another project in the same repo. The “—drop-empty-revs” switch ensures that if there were no changes to “ProjectPath” in the revision then there will be no records in the dump. It sounds like this will create gaps in the revision history and this is true for the dump file but the gaps will be filled when we load it back in later.

Here’s where is gets tricky; 9 times out of 10 this command will run fine. Where it goes wrong is if your project has moved around a bit. The path named “ProjectPath” might be valid now but what if it was renamed at some time? Or moved to another root path in the repo? Or files from another path were moved into it? Remember the command above only included revisions in the “ProjectPath” folder. Here’s what happens:

Getting the filter right

The skipped revisions are there because we added the drop switch to the command. Further up this screen grab the filter was happily extracting revisions found in the included path until it came across a reference in revision 359 to a path that wasn’t included. This is why we see the “Invalid copy source path” error. What we need to do is add this path so it will be included in the filter.

You may need to go through this process multiple times. I had to do it half a dozen times on a project recently and unless you know ahead of time what all the referenced paths are (unlikely with a large project spread out over a number of years), it’s just simple trial and error to discover them.

Loading the dump

So now we have a dump which has successfully filtered every revision within the “Websites\ProjectPath” folder (and of course, the other paths the project may have occupied before that). The next step is to load this into a brand new repository which in my case I’ve called “ProjectRepo”.

Back to the svnadmin command again but this time with the “load” subcommand. The first thing we’re going to do is to use the –ignore-uuid command. This is important because the unique identifier of the original repo is going to be different to that of the new “ProjectRepo”. Next we’ll just specify the repo we want to load into and the path of the dump file.

What this command is going to do is step through the dump, revision by revision, and restore it into the new repo. You can actually watch this as it goes; open up your favourite repo browsing tool and whilst the load is running you’ll see the project being gradually reconstructed in chronological order with sequential revision numbers (no empty revisions). Until you hit another error…

Fixing the dump

This is the second point where things can start to go awry. Most times the load will be a smooth process but as with the missing paths error above, a repo that has moved around a lot may well end up with errors.

Let’s look a bit closer at this error: “File not found: transaction ‘0-1’, path ‘Websites/ProjectPath’. To understand what’s going on here we need to get our hands dirty and start looking inside the dump file.

Because the dump file is uncompressed and plain text (except for any binary content, of course), it’s easy to open it up in a text editor (I usually use Notepad++). When you do you’ll notice the file is broken up into a series of commits, each with it’s own revision number (line 15 in the image on the right), commit message (line 22), author (line 26) and then a series of paths and actions.

The problem lies in the very first revision. What’s happening is that when line 33 is being executed, the path “Websites” does not already exist and this is why we’re getting the “File not found error” above. This transaction worked just fine in the original repository because the path had already been created in a previous revision but because the filter ran at a level lower than this the transaction which created the folder wasn’t included in the dump.

The fix is easy; we just need to rewrite history! This is actually a good thing because it will allow us to get the repo structure right. Ideally, the work should be in a folder called “trunk” at the root of the repository. What we’re going to do is change every occurrence of “Websites/ProjectPath” to just “trunk” with a quick find and replace. This way when the revisions are imported to the repo it will appear as if the structure was correct from day 1.

Rinse, lather, repeat

The missing paths problem is one which may occur multiple times in the repo. Unfortunately you don’t know how often or where until you run through the load process again and it fails. Each time you need to go through and fix the dump file as per the process again until everything runs without error.

Summary

This all ended up being a lot more difficult than I originally expected, largely because of the unexpected errors which continued to crop up, the volume of trouble shooting required and the long durations to dump, filter and load a very large repository. Lesson of the day; think very carefully about your repository structure early on because it can be a serious headache to reorder later on.

Subversion

The black art of splitting a Subversion repository