Apps

Why developers are abuzz about distributed version control systems

Learn what sets distributed version control systems apart from traditional version control systems, and why more and more developers prefer distributed version control.

Version control is something that most developers see as a fact of life. You need to use version control so you don't make a mess, though it's a bit of a hassle, and it usually does not work to make your life wonderful by itself.

When I hear developers talk about version control, it is typically complaints, especially about some of the older systems that are (fortunately) discontinued, unsupported, and quickly going away. So when developers I know and respect started talking about how much they like their distributed version control systems (DVCSs), I paid attention. Based on my research about DVCS, here's a look at what makes these systems different from traditional version control systems, and why many developers prefer DVCSs.

What sets DVCS apart

The primary difference between DVCS and traditional version control is where the repository lies.

In traditional version control systems, there is a centralized repository (perhaps mirrored to a few nodes for performance and redundancy) that contains the master copy of every file. Developers have a non-authoritative copy on their local drives, and when a user wants to make their copy the master copy, it needs to be checked back in. Users can lock files in the repository to signal to others that the file is being changed, and when a file is checked into the repository, any differences between what is on the server and what is being checked in need to be resolved if the files were changed separately by different users. If you want to experiment or create a different version, you branch the code tree, which makes an entirely different copy. Eventually, you can merge the two trees back together again, and you have to resolve the differences between them at that time.

A DVCS turns this around. Instead of a centralized repository in a central server, each developer maintains their own repositories (this is what makes it "distributed"). It is very similar to a peer-to-peer network. This creates much more redundancy than in a traditional version control system. In and of itself, this is nothing special. The big difference is the approach to check ins. DVCS uses change sets and creates one on each check in, whereas traditional systems just replace the file and give it a new version number. If two developers make changes based on the same initial parent in the tree, this creates two parallel branches automatically. Merging can occur more easily because the system knows what is actually different between the two branches, as opposed to just seeing a different version number and forcing a comparison. The Mercurial wiki has an excellent tutorial on the topic that explains it in depth.

Where DVCS shines

There are two big advantages cited with DVCS: the distributed end of things and the workflow. The distributed aspect is great for geographically dispersed teams, loosely organized teams, highly mobile teams, and other scenarios where not everyone is always connected to a central repository. But, as Joel Spolsky said, it's the branching/merging that is much more interesting, and from what I've read, I agree.

Because of the nature of change sets, it is much easier to merge someone else's changes with your own, because you aren't trying to reconcile your changes to a central branch and their changes to a central branch -- you can just get the changes they made instead. This makes it much easier to actually experiment, create various proof of concept versions, cut production or QA versions, and so on. And when these tasks become easier, they happen more often.

As someone who has been using Team Foundation Server (TFS) for a couple of years, I hate the merging story. It's so bad that I've learned to not branch, and we had a lot of growing pains when our developers started working together on the same project. I can definitely see how this is a huge advantage when working with large teams with a lot of moving parts because each sub-team can work on their own part of the system and easily merge back into a project repository when their changes are ready to be seen by others.

Plans to experiment with DVCS

I'm frustrated with TFS. Is it a bad tool? Not at all. TFS has a lot of built-in functionality, such as being able to tie work items to check ins, decent reporting capabilities, and build management. But it is really miserable to use for version control. At least once a week someone on our team has a TFS question, and it is not user error, it is a problem with the sheer complexity of TFS and the unintuitive revision model.

I am ready for a change. I will probably experiment with either Git or Mercurial on personal projects in the near future, and if that pans out, I will look into bringing it back into work. Based on the results and comments in my recent TechRepublic poll about version control, it seems that I am not the only one moving in this direction.

J.Ja

Disclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides; he has a contract with OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and articles; and he has a contract with OutSystems to write articles, sample code, etc.

About

Justin James is the Lead Architect for Conigent.

17 comments
HeadScratcher7
HeadScratcher7

I recently started trying to use Bazaar for my personal projects. I liked the fact that it puts just one .bzr folder in the main directory -unlike svn which spreads .svn folders everywhere. I really wanted to like this software but for me it seems to lack polish and doing even simple things like checkout/commit from a subversion repo could be trying (it has a svn plugin). For instance, it defaults to leaving all files unchecked for a commit. I had to walk through the list and check the modified files to include -which bit me bad one day when I missed one on a checkin from home and then had to drive back from the office to retrieve the file. Subversion just works -but perhaps a big part of that is I'm very familar with it's standard workflow. Some of the problems I had: Conflicts could cause an update to fail. Retrying the update would cause just updated files to be updated again and the previous updated versions renamed to file.moved. Then I hit a bug in the conflicts resolution panel that kept me from doing checkins. I finally ended up uninstalling Bazaar entirely, but now I realize some of the problems I was having with making it co-exist with Subversion may not have been Bazaar's fault. First I decided to install Subversion and use it for work related projects instead of using Bazaar's svn plugin. But I couldn't get TortoiseSVN to show up in my context menus with TortoiseBzr already there. What I now know is that for my Win7 x64 system, the 32 bit version of TortoiseSVN won't install into the context menu. (I'd previously blamed this on bzr.) So I may give Bazaar another try. Perhaps I'll removed it's svn plugin and see if I can get both to coexist in the context menu. Has anyone compared Bazaar against Mercurial? Or has anyone had success with getting two VCS both using Tortoise to coexist in the context menu? I know one of the big selling points for Bazaar is that it's suppose to treat folders as first class objects.

ChazConsult
ChazConsult

I have been using Mercurial at work for years and like it a lot. It is so much easier to deal with merging changes from multiple developers than any other non-DVCS that I have used and I have used a bunch. Note: I have not used GIT so I cannot comment on it.

a_vagga
a_vagga

If DVCS creates a changeset so does TFS upon each checkin....so whats the fuss about DVCS??

thelalj
thelalj

Having recently moved from SVN to Mercurial I've got to say how much easier it is to use. One of the great things about Hg is the ease of branching and merging. It makes it almost easy to work on different versions of the same project and be able to merge bug fixes/upgrades from an old branch into a newer version of the same project. Joel Spolsky?s excellent introductory tutorial at hginit.com also gets one up to speed v quickly. Also as apotheon says: the command line tool, hg, is much easier to grasp, which never hurts take-up.

Jaqui
Jaqui

1) if there is 75 MB of files in the project a GIT repo is 150 MB in size. The "change sets" are literally the same size on the drive and in data transfer as the source code files, so you use double the drive space and double the data transfer for it. 2) it does NOT lend itself to making a tarball easily of the stable trunk of the project. This means that you have to fight to get an archive of the sources for people to use on a system after building from sources as a consumer of the product rather than a developer. If you are only using the project as an end user, the "change set" info in a repo checkout is wasted space. the extra data transfer for the change set files is wasted. GIT makes no effort to suppport this use of the sources.

Sterling chip Camden
Sterling chip Camden

You got it right when you said that making such operations easier makes them happen more frequently. Designed for agile instead of waterfall.

apotheon
apotheon

I've looked into it, but haven't really done side-by-side comparisons in use. I've read a lot about Bazaar and touched it a couple times, though. It seems to be slower, more prone to issues, and more prone to regular changes in the interfaces for, and implementations of, basic functionality. It also does not seem to really add anything that you don't get with Mercurial. Furthermore, I'm not as impressed with its code hosting options as I am with the options for Mercurial -- which include BitBucket, ShareSource, and Google Code, among others. While I'm at it, I like Merucrial's C and Python codebase more than Bazaar's C, Pyrex, and Python codebase (or Git's Bourne Shell, C, and Perl codebase for that matter). While I'm a fan of using the right tool for the job, and thus have no problem with a codebase that's split between a high performance language and a higher-level dynamic language (say, C and Python), I'm not a fan of unnecessarily diversifying the implementation languages (Why use Bourne shell and Perl? Why use Python and Pyrex?). I know that Pyrex is basically an extension to Python intended to ease the process of writing parts of modules in C, but it still seems like an unnecessary layer of abstraction to me. I am not yet sold on the idea that we actually need folders in and of themselves to be treated as first class objects, either. I guess it feels more "natural" that way, but it's just not really necessary, I think. As for TortoiseSVN -- I used it a little bit years ago, and did not really have problems with it, but I wasn't trying to use it with another VCS at the time, and did not use it really heavily. Basically, my major VCS use has always been in a Unix environment.

Justin James
Justin James

TortoiseSVN never worked for me, period, even by itself. I've discussed with others who also had bad experiences with it. I haven't tried the rest of your combination, but for me, I wasn't impressed with TortoiseSVN. J.Ja

apotheon
apotheon

Git, Bazaar, and Darcs are three of the four most popular open source DVCSes, and none of them use changesets for their history models. Two of them use snapshots, and one (Darcs) uses patches. Of the four big names in that space, only Mercurial actually uses changesets per se. Meanwhile, the venerable (and obsolete) CVS uses changesets -- and you really don't want to have to merge with CVS. What really matters is that DVCSes are designed with branches in mind, because they're intended to be essentially branched every time anyone does any work on the source. It is assumed from the beginning that two different developers are going to be doing work at the same time. Each of them takes a somewhat different approach to managing that than the others, but the end result is that they tend to be very good at managing the merging process, because merging is such a big, important part of their assumed workflow. Centralized VCSes assume a central repository, where all commits happen in one place, and as such the merging process design is essentially tacked on as an afterthought. This isn't a problem when the need to enforce a strictly linear workflow is required, because enforcing that kind of workflow assumes that no branching or merging or parallel development will ever happen, usually ensured by mechanisms such as locking, but I think the incidence of an actual need for that kind of workflow is much rarer than most people using CVCSes realize. On the other hand, there are people using DVCSes who think it's never needed, which is also not the case.

apotheon
apotheon

On single-user projects under certain types of controlled circumstances, Subversion still beats the popular DVCSes. For pretty much everything else, though, Mercurial is my VCS of choice. The three Big Names in open source DVCSes are Bazaar, Git, and Mercurial. Bazaar is, from what I've seen, in many respects like a Mercurial knock-off with poorer performance, somewhat worse design, and a totally unstable design philosophy (in that they keep changing it). I'm also less than encouraged, as a developer, by the fact that it uses Pyrex for a significant percentage of its source, which would mean I would not only have to deal with Python, but also learn a new Python sub-language, if I wanted to start hacking the source -- but that probably doesn't matter to most people. Git, meanwhile, is a remarkably big codebase compared to the other two (more than twice as big as Bazaar, and about nine times as big as Mercurial) that includes a lot of shell scripts, only offers a tacked-on hack for file rename support, and provides no internationalization support. It's also worth noting that Mercurial's history model is better suited to saving space than the other two, because it stores changesets rather than snapshots -- and it is still remarkably fast (faster than Bazaar in my experience). I think Git is the fastest of the bunch, but only by a matter of degrees that in practice would likely be difficult to detect without some micro-benchmarking. I also find the interface for hg (the command line tool for Mercurial) more friendly and well-suited to the Principle of Least Surprise than the interface for git. My preference, as I said, is for Mercurial. Now you know some of the reasons why. Your mileage may vary, of course.

Justin James
Justin James

... what's important is how they are handled. TFS' handling of changesets stinks. It is treated as a group of files checked in at the same time, but as far as TFS is concerned, they have nothing to do with each other, other than sharing a changeset number and a checkin comment. A DVCS would treat a changeset as a branch in and of itself, a discrete unit to be worked with. For example, with a tool like Mercurial, you can apply one changeset to another branch quite easily. Try that in TFS... in TFS, saying, "take Bob's experimental branch and apply his changes to the main trunk" isn't easy. Possible? Yes. But enough effort that it really discourages working like that. J.Ja

Sterling chip Camden
Sterling chip Camden

... of moving FreeBSD sources to either git or mercurial. A project of that size might make performance more of a consideration.

apotheon
apotheon

One thing I really like about Mercurial, as opposed to many CVCSes, is that it makes collaboration a lot better. You can commit changes locally with frequency, making sure that you have very small steps to make in case of a problem. It means you can check in code that breaks the build locally as part of a larger overall change that will not break the build. Only when you have everything in working order do you then need to commit somewhere that other people might get the code. By contrast, committing code to a centralized repository is the only way to check in your code in a CVCS. That means that you can't make small, incremental, frequent commits that break the build without breaking it for everyone, and that's a terrible idea. That reason alone should be enough to consider using a DVCS instead.

Sterling chip Camden
Sterling chip Camden

... and more about bandwidth. Distributed repos do not require staying connected to the central repo in order to do commits. When the central repo is on a VPN somewhere across the country, the time required for a push or pull can discourage that activity. You want frequent commits in order to properly track changes, so if those can be done easily on a local repo then you improve the project.

a_vagga
a_vagga

Thank you Apotheon/J. Ja, As far as my experience is concerned with TFS I've found applying changesets to other branches way too easy/convenient. Not that I am a big advocate of TFS, but I am really trying to understand in my mind "what is it about DVCS" that is driving its adoption in a big way as mentioned in this article. I am just keen to understand this new gamut. Whatever I have gathered from you all I am not really impressed to make a move from one stable thing to a new one...

apotheon
apotheon

I'm not very familiar with TFS in practice. I just know some stuff about how it works in a general sense, for the most part -- so all I could really talk about is the difference between a DVCS like Mercurial and a CVCS in general, with a little bit of my abstract knowledge of TFS for color. Your direct experience with it helps nail down the details, of course.

apotheon
apotheon

On the other hand, big projects that differ significantly from the Linux kernel project provide the perfect opportunity to find unexpected bottlenecks in a tool like Git. It's really anyone's game until the tools are tested in the new environment. I wonder if the snapshot vs. changeset history models will become an issue on the storage side of things.