Open Source

Don't fear the fork: How DVCS aids open source development

Forking is not the danger to a good open source project that people think it is. In fact, forking can -- and should -- be a good thing.

Every once in a while, some extremely popular open source project faces what is generally regarded as one of the most painful, frightening experiences for such a project and its user community: the fork. An argument can be made that divergent evolution for purposes of specialization — such as when Knoppix burst onto the scene, based on Debian but customized for use as a LiveCD — is not a "true" fork. A fork, one might argue, is only what happens when the codebase is copied and taken in a slightly different direction because it is intended to replace (or at least compete with) the original project due to disputes between people who have different visions for it, rather than being intended to complement it by filling an otherwise empty niche.

LibreOffice is a recent example of such an acrimonious fork, though the overtones of LibreOffice's guiding Foundation are consciously friendly. This fork is a direct result of Oracle acquiring as part of its Sun buyout. In the words of Glyn Moody's article, "The Deeper Significance of LibreOrrice 3.3," in ComputerWorld UK - Open Enterprise:

Real forks are relatively few and far between precisely because of the differences between forking and fragmentation. The latter may or may not be inconvenient, but it's rarely painful in the way that a fork can be. Forks typically tear apart coding communities, demanding that programmers take sides.

As he points out, this kind of hostile forking — the open source equivalent of the corporate world's hostile takeover — is hard on the developer community. In fact, it is generally harder on that community than it is on the user community. Clearly, Mr. Moody is one of those people who subscribes to the above description of the difference between a "true" fork and specialized divergent evolution, the latter of which he calls "fragmentation".

This sort of thing typically arises in the wake of the people in charge of an open source project acting contrary to the spirit of open source development. Developers and users both rebel against this, feeling that their own personal stakes and investments in the software are being squandered by control freaks. Right or wrong, this attitude is effectively the norm amongst community-participant developers and users who are outside the reigning inner circle of basically any popular open source project. If you are part of that inner circle, you ignore or denigrate that attitude and take a more strictly proprietary attitude toward the project at your own risk.

This is only one type of fork, though, and should be viewed as an aberration rather than the strict definition of "fork". One problem with this definition of "fork" is that it does not follow from the etymology of the term. When people refer to "forking" an open source project, they are making an analogy to a term used by programmers for other purposes; the creation of a copy of a running process. The fork() function is not handled properly by MS Windows, but just about any other modern general purpose operating system does basically the same thing that our open source Unix-like OSs do. A similarly appropriate term might be "clone" — initially identical, but diverging over time.

Developers are increasingly using a type of software that often encourages the use of the term "clone": distributed version control systems, also known as DVCSs. For those who are not aware of them, they are distributed approaches to the same thing that has been done with version control systems for years. They track changes that have been made in a codebase so that those changes can be undone if needed, freeing developers from the fear that every change to the code might break something. This can also provide smoother processes for merging changes from more than one developer at a time, or maintaining updates for an older version while working an a new release version.

Traditional, centralized VCSs — or CVCSs, as some are now calling them — store the canonical codebase on a central server. Developers "check out" the current state of the code, then make changes to that. When they have a working update that solves the problem they set out to solve, they check for changes that others might have "checked in" to the main repository, merge if they need to, and check in their own changes so that everybody else on the project can get them. Distributed VCSs differ in that there is no central server from the point of view of the VCS software.

In practice, many open source projects effectively maintain a central, canonical server repository that is used in a manner similar to how CVCSs work, even when they use a DVCS. For such projects, you might ask why they do not use a CVCS, but while a centralized system might actually be better in certain projects, open source projects benefit from the use of a DVCS even when using a centralized project management model. The more ad-hoc approach taken to task assignment, along with the consequences of that approach, is aided by the ability to check in changes locally in a broken state. By contrast, checking in code to a central server when it is not in a working state is a good way to anger or annoy almost everybody working on the project.

That is the secret to the value of the DVCS, really: the fact that every developer's copy of the codebase is its own version control repository allows the developer to have all the benefits of working as a solo developer, committing changes and backing them out at whim to achieve the best possible workflow to achieve his or her aims, without inflicting the results of such behavior on other developers. Because DVCSs are designed to work this way, though, they are also designed to allow people to merge changes between any two arbitrary "copies" of the same codebase. This means that two developers who want to collaborate on a particular codebase can track their changes in their own individual source code repositories, and share that code easily between them when they reach a point where they deem it a good idea to synchronize their codebases. When they have something complete and fully working, they can then push those changes to the canonical codebase maintained by the core developers.

The reason for the scare quotes around "copies" should become clear here. The various codebases stored in numerous individual, distributed repositories are not really copies, even if they started that way. They are individual repositories that just happen to store code with the same genetic ancestry. It is only by convention and agreement that any one repository is considered the "main" repository. This is why, rather than "checking out" a "copy" as one would with a CVCS, users of a DVCS "clone" a codebase. The two leading open source DVCSs, Git and Mercurial, use the command clone to refer to the process of creating a new repository from another. If a developer has commit access to the parent repository, that developer can then "push" changes from his or her clone to update that central-by-convention codebase.

The concept of cloning brings us full-circle. It is, in fact, a fork. Some might consider it a temporary fork when they clone a repository, but there is no technical limitation that necessarily makes it temporary. It is temporary only by convention. Look at the biggest code hosting sites for Git and Mercurial users — GitHub and BitBucket, respectively — for an example of this fact in practice.

Both sites feature a "fork" button right there at the top of the main page for a project repository. Anyone who has a (free) account at either of these sites can click that button to clone the repository within his or her own allotted account space. A clone is a fork, but using a DVCS makes sharing code between forks so easy as to be effectively natural. Rather than being an acrimonious feud between mutually hostile programmer ideologues, forking emerges as the way things are done when people want to help other people out with their software development projects. If you do not have commit access to a given project repository, but you have an idea for how to contribute, you can fork the project, make changes in your own clone, then — using the built-in features of a site like GitHub or BitBucket — send a "pull request" to the admin of the original project site. That administrator can then see to it that the changes are evaluated and tested and, if they are determined to be of sufficient quality and value, they can be merged with the core project.

This is, in a nutshell, what it is to be an open source project. Source code is passed around, people hack on it, some of those changes make it back into the main project, and everybody is happy. There is nothing hostile or acrimonious about it. If someone's project clone goes in a direction that the original developers did not envision, and that turns out to be the "right" way to do things, they may even just move their efforts from supporting the formerly central project to supporting the new project. The point, here, is not to feed the ego by maintaining a stranglehold on the project's source code repository, but to help the codebase become the best it can be.

Even if two clones of a project become major projects in their own rights, they need not participate in a death match to determine the proper successor. Their mutual successes would suggest that they each serve a needed purpose, and if those needed purposes are incompatible, they can both flourish. Without a fork, this would never have happened. Even better, code contributed to one that is equally suitable to the development of the other can be easily shared. FreeBSD, NetBSD, and OpenBSD are all forks following from a common ancestral codebase that originated at UC Berkeley, but those forks did not make those projects into eternal enemies. This is not the Highlander movie franchise; many may exist in harmony, and even help each other improve over time.

DVCSs make that easier, and DVCS-based code hosting sites like BitBucket and GitHub actively encourage that behavior even more directly and obviously, in part by removing the undeserved stigma the term "fork" has acquired in the open source community thanks to unpleasant splits like those prompted or accelerated by Oracle's acquisition of Sun.

In short, the truth of the matter is that open source communities and developers should not fear the fork. Of course, sometimes egos get in the way, but in some respects the number of active forks at any given time should be regarded as a measure of that community's health.

The thing an open source project in the true spirit of open source software development has to fear is not forking, but clean-room reimplementation. When that happens, you know something is wrong.


Chad Perrin is an IT consultant, developer, and freelance professional writer. He holds both Microsoft and CompTIA certifications and is a graduate of two IT industry trade schools.

Editor's Picks