I used CVS for three years. It was awful. I’ve used Subversion for the last two. It’s less awful, but I’m really starting to feel its warts. Some issues:

In this talk at Google Linus Torvalds amusingly says that Subversion is a stupid doomed project because it claims to be CVS done right, and there is no way to do that.

I have to agree. CVS is a kludge around RCS (see rcsintro(1)), which versioned files independently. CVS tries hard to treat groups of files as single entities – which is what you want – but because each one has its own revision number, and because of the way branching works (inherited from RCS’s insanity-provoking way of representing branches in the revision numbers), it becomes a nightmare. Add to this that CVS does not version control directories (because RCS didn’t – sigh) and things become intolerable. Try moving a subdirectory of files in CVS. It’s something you never want to do.

Subversion got some of this right. The revision numbering scheme applies to the whole repository. You check in one file – or two hundred – and the whole repo clicks up one revision. Of course each file still knows in which revision it was last modified, but there is one sane numbering scheme.

Also, directories (and other useful metadata) are under version control in Subversion. And moving things around is relatively easy – though it can make finding files in the log harder, and can wreak havoc on svndumpfilter. Here is what the book says:

Also, copied paths can give you some trouble. Subversion supports copy operations in the repository, where a new path is created by copying some already existing path. It is possible that at some point in the lifetime of your repository, you might have copied a file or directory from some location that svndumpfilter is excluding, to a location that it is including. In order to make the dump data self-sufficient, svndumpfilter needs to still show the addition of the new path – including the contents of any files created by the copy – and not represent that addition as a copy from a source that won’t exist in your filtered dump data stream. But because the Subversion repository dump format only shows what was changed in each revision, the contents of the copy source might not be readily available. If you suspect that you have any copies of this sort in your repository, you might want to rethink your set of included/excluded paths.

Translation: If you ever moved any files or directories, and want to ever use svndumpfilter, your life is going to be terrible. You’ll have to find all the place that things got moved and manually add special “include” and “exclude” lines.

I briefly looked at using svndumpfilter – to split my repo, which contains several projects, into a bunch of repos, each containing one project. After reading about svndumpfilter I gave up on this idea.

And this is a modest repo with a bunch of small projects. Woe betide anyone with large active projects facing this kind of decision.

Version control is supposed to help you develop – code or poetry or whatever. Often you want to do something speculative – to try out a crazy idea. To do this without disturbing the “main” line of development (which you might be sharing with others) you want to do this on a branch, and (possibly) later merge it back into the main line. This should be easy – and yet, arguably Git is the first system to get this right.

In CVS branching is hard and merging is harder. In Subversion branching is easy, and efficient. But merging is still hard. The user still has to manually keep track (the authors, admitting the brokenness of their tool, suggest doing this in the log message) of which merges from which pieces of which branches have been done already... I’ve done very little of this, and on modest projects, and it’s still scary. I couldn’t imagine trying to merge changes to Linux’s SCSI subsystem in this way.

Again, here is what the Subversion book says:

Ideally, your version control system should prevent the double-application of changes to a branch. It should automatically remember which changes a branch has already received, and be able to list them for you. It should use this information to help automate merges as much as possible.

Unfortunately, Subversion is not such a system; it does not yet record any information about merge operations. When you commit local modifications, the repository has no idea whether those changes came from running svn merge, or from just hand-editing the files.

What does this mean to you, the user? It means that until the day Subversion grows this feature, you’ll have to track merge information yourself. The best place to do this is in the commit log-message. As demonstrated in the earlier example, it’s recommended that your log-message mention a specific revision number (or range of revisions) that are being merged into your branch. Later on, you can run svn log to review which changes your branch already contains. This will allow you to carefully construct a subsequent svn merge command that won’t be redundant with previously ported changes.

This is what makes Linus laugh out loud. In his Google talk he says something like “yeah, they made branching easy, but merging is still hard, and they’ve had five years and they still haven’t fixed it!”

One last carp about Subversion – and CVS, and practically every other system out there: it has no notion of data integrity. Files are stored as files – really as a file and a lot of reverse deltas to other versions – but there are no hashes, CRCs, or checksums that check that the file I checked in two years ago and the one I checked out today are actually the same file. Memory corruption happens; so does disk corruption. It’s not a question of if but of when.

A version control system is also an archival medium. It’s a place to put things that you don’t want to lose. So it would be heartening to know that your VCS is doing its best to keep your data safe, right?

Subversion isn’t.

I was excited about Subversion when I first started using it. “CVS done right” didn’t sound so bad to me. Now I agree with Linus. The abstractions are wrong. The model is wrong. (I should have been worried by the fact that the Subversion developers seem to actually like Apache.)

And I’m not talking about Subversion’s central repo nature, but about its data model. (say more about this)

I’m giving up on Subversion. Let’s talk about Git instead.

Getting Git

Git is Linus Torvalds's new (as of spring 2005) version control system, written when he lost the license to use bit keeper for free.

A Red Hat engineer talks about why Git matters to him, and in another post he highlights the functional filesystem aspects of Git; that two repos on the same machine can share objects, since objects are immutable.

http://marc.info/?l=git&m=116129092117475

blame is slow in Git, but Linus doesn’t use it. doesn’t give him the big picture. I agree. You see lines in a file, and who changed them. But you don’t see the larger context of the change, and you certainly don’t see the lines that aren’t in the file.

http://marc.info/?l=linux-kernel&m=111314792424707

Also nice that Linus uses the epoch time (count of secs since 1970 Jan 01) as a timestamp – zero parsing! Why don’t more people do this? HTTP, eg.

From http://lwn.net/Articles/165202/:

I’d say that the huge advantage of Git over everything is that it’s dead simple to interface with. So you can say, “I wish I had a revision control system that worked like this,” and it’s a couple of shell scripts to do, and the results are compatible with everybody else. (For example, I’ve got a 237-line Python CGI that lets you edit files in the a Git repository on the web, and then commit sets of changes. This required a dozen-line patch to an obscure part of Git, and that only because I was the first person to try to work with no working tree and no temporary files. Try doing that with cvs, or practically anything else.) More generally, it means you can really integrate revision control into your processes however you want, rather than just following the VCS’s idea of how you interact with it.

This matters to me because I (briefly) considered trying to interface to Subversion to get my wiki pages out of it (by talking to the API)... until I read (in the svn book) about how that part of the system is a bit flaky and not as well-tested as the core...

Lots of interesting links – I got the mailing list links above from it – in Git (software).

From the Google talk: cryptographic hash (SHA1) guarantees the integrity of your data – even years later as it has moved from hard disk, to DVD, to some other archive medium.