Tag Archives: version control

Monolithic repositories versus project repositories

Recently, Hacker News got ahold of Gregory Szorc’s article on monolithic repositories, and even Wired is weighing in on how big codebases should be organized. While the discussion is interesting, it seems to focus on two extremes: on the one hand, putting all of a company’s code into a single monolithic repository, and, on the other hand, breaking a big company’s code code up into many, many small repositories. Both extremes are too simplistic. The better approach is to align repository boundaries with the software’s build, deployment, and versioning boundaries. What does this mean exactly? Perhaps it is best illustrated with some examples:

  • You have a small app that knows how to deploy itself, and you rarely, if ever, need to deploy anything other than the current version of it. Keep the deployment scripts in the same repository as the app.
  • You have a few existing APIs written in the same language and built using similar libraries. You’re about to build a replacement for one of those APIs in a different, high-performance language, which will use a different storage backend. Not only will the build and deployment steps for this new code be different, but you will want to deploy new versions of both the new and old app independently. Here you have build, deployment and versioning boundaries, so it’s natural to use a different repository for the new code. Since it’s a different language, the new code won’t be relying on any existing code, so there’s no temptation to copy and paste code into the new project.
  • You need to support a complex set of deployment environments, clusters of virtual machines running different services, connecting to different backends, testing, staging, and production environments running different versions of services, and so on. Here we have clear versioning and deployment boundaries: your deployment process needs to support deploying different versions of the code. Your deployment process needs to understand and know about different versions of what it is deploying, since you might need to rollback and deploy an earlier version. It also needs to have the ability to deploy, for example, the current stable version of the site backed by MySQL database to the production environment, and also deploy the head of an experimental branch backed by PostgreSQL to the staging environment. You won’t be able to encapsulate the switch from MySQL to PostgreSQL in a single merge of a feature branch anyway, so it’s just a headache to maintain deployment scripts which know about how to deploy to MySQL in the same branch where you have just removed all your MySQL dependent code. So the deployment scripts are better kept in their own repository.
  • You have an API for internal use only, and a single consumer of that API, both running inside your private network. Here you have total control over versioning: you can change the API and its consumer in a single commit, ensure that both are updated at deploy time, and there’s no versioning boundary. If the API and its consumer are also written in the same language, the there should be little to no reason to keep them in separate repositories.
  • You have an API accessible over the public internet, and Android and iOS apps which talk to that API. Here you already have significant versioning boundaries. You can’t be sure that every app out there is up-to-date, so you have to keep the old APIs up and running for a while, or make sure the latest APIs are backwards compatible. And you have significant deployment boundaries: you wouldn’t want to wait two weeks to deploy an update to the API because you are waiting for the new API support in the iOS app to make it to the App Store. Since any version of your API must support several different versions of your mobile apps, and any version of your mobile app must be able to talk to several different versions of your API, there is again no hope of rolling any change into a single (merge) commit, and there’s no added cleanliness or simplicity to gain from having these different pieces of the software in the same repository. So this code can be split into multiple repositories.

There are a few more things to remember when setting up repositories.

Splitting vs. Merging

It’s far simpler to split out part of a monolithic repository than is to merge two independent repositories:

  • If you’re splitting, just create a new repository, push the old code there, remove everything but what you want to split out, and commit. You can then re-organize it if you like, but you don’t have to (and if you’re using git, you can even filter the history). All the files keep their history, and you don’t have to worry about file name collisions.
  • If you’re merging, you most likely have to do some top level reorganization first, then you can merge one repository into the other, but now looking at history before this merge point will show a jumbled mess of commits from both repositories.

Increasing Fragmentation

If you feel like you’re being forced into creating a third repository to store code that is needed by two other repositories, then that’s probably a sign that those two repositories should be a single one. This is a common trap that projects get into; once they have split their repositories too finely, then the only solution seems to be more splitting. When considering this option ask yourself: do these two repositories really have different build, deployment, and versioning boundaries? If not, bite the bullet and merge the repositories, rather than creating a third one for the shared code.

Ease of Access

Ease of access to code is often presented as an advantage to the monolithic repositories model. But this argument is unconvincing. A programmer can still have access to all the company’s code, even if that requires cloning multiple repositories. And the vast majority of cases, a programmer is going to work on two, maybe three, different projects at the same time. It’s not as if Google is filled with programmers who work on WebM encoding for YouTube on Monday, map-reduce for Google search on Tuesday, CSS and JavaScript for Gmail on Wednesday, Java for Android on Thursday, and Chrome on Windows on Friday. Programmers like that are extremely rare, and you shouldn’t optimize your repository structure to make them happy, especially not if it means forcing everybody else to download and track changes to large amounts of code they will never touch.

In Conclusion

To sum up, both the one monolithic repository dogma, and the many small project-based repositories dogma are oversimplified to the point of being harmful. Instead, focus on splitting your code into repositories along its natural versioning, build, and deployment boundaries.

Introducing Fragments

Today I am announcing Fragments, a project I’ve been working on for a few months. It’s on GitHub and PyPI.

Fragments uses concepts from version control to replace many uses of templating languages. Instead of a templating language, it provides diff-based templating; instead of revision control, it provides “fragmentation control”.

Fragments enables a programmer to violate the DRY (Don’t Repeat Yourself) principle; it is a Multiple Source of Truth engine.

What is diff-based templating?

Generating HTML with templating languages is difficult because templating languages often have two semi-incompatible purposes. The first purpose is managing common HTML elements & structure: headers, sidebars, and footers; across multiple templates. This is sometimes called page “inheritance”. The second purpose is to perform idiosyncratic display logic on data coming from another source. When these two purposes can be separated, templates can be much simpler.

Fragments manages this first purpose, common HTML elements and structure, with diff and merge algorithms. The actual display logic is left to your application, or to a templating language whose templates are themselves managed by Fragments.

What is fragmentation control?

The machinery to manage common and different code fragments across multiple versions of a single file already exists in modern version control systems. Fragments adapts these tools to manage common and different versions of several different files.

Each file is in effect its own “branch”, and whenever you modify a file (“branch”) you can apply (“merge”) that change into whichever other files (“branches”) you choose. In this sense Fragments is a different kind of “source control”–rather than controlling versions/revisions over time, it controls fragments across many files that all exist simultaneously. Hence the term “fragmentation control”.

As I am a linguist, I have to point out that the distinction between Synchronic and Diachronic Linguistics gave me this idea in the first place.

How does it work?

The merge algorithm is a version of Precise Codeville Merge modified to support cherry-picking. Precise Codeville Merge was chosen because it supports accidental clean merges and convergence. That is, if two files are independently modified in the same way, they merge together cleanly. This makes adding new files easy; use Fragment’s fork command to create a new file based on other files (or just cp one of your files), change it as desired, and commit it. Subsequent changes to any un-modified, common sections, in that file or in its siblings, will be applicable across the rest of the repository.

Like version control, you run Fragments on the command line each time you make a change to your HTML, not before each page render.

What is it good for?

Fragments was designed with the task of simplifying large collections of HTML or HTML templates. It could replace simpler CMS-managed websites with pure static HTML. It could also handle several different translations of an HTML website, ensuring that the same HTML structure was wrapped around each translation of the content.

But Fragments is also not HTML specific. If it’s got newlines, Fragments can manage it. That means XML, CSS, JSON, YAML, or source code from any programming language where newlines are common (sorry, Perl). cFragments is even smart enough to know not to merge totally different files together. You could use it to manage a large set of configuration files for different servers and deployment configurations, for example. Or you could use it to manage bug fixes to that mess of duplicated source files on that legacy project you wish you didn’t have to maintain.

In short, Fragments can be used anyplace where you have thought to yourself “this group of files really is violating DRY”.

Use it

Fragments is released under the BSD License. You can read more about it and get the code on GitHub and PyPI. And you can find me on Twitter as @glyphobet.

Special thanks to Ross Cohen (@carnieross) for his thoughts on the idea, and for preparing Precise Codeville Merge for use in Fragments.

Push to Github using Mercurial

It’s simple to push Mercurial source code repositories to GitHub (or another git server). This is useful if you can’t, or don’t want to, switch off of Mercurial to git. You can take advantage of GitHub’s superior community and features (sorry, BitBucket) without having to puzzle over ____ (anti-git screed left as an exercise to the reader).

To do this, you’ll need the hg-git plugin for Mercurial, by Scott Chacon and others. The technique is outlined on hg-git’s read-me, but it still took me some effort to get working right, so I thought this blog post might be helpful.

I assume you already have Mercurial installed, a project in a Mercurial repository you want to push, and a GitHub account. Here’s how you do it.

Install hggit

If you’re using MacPorts and Python 2.6, you should be able to do this with:

sudo port install py26-hggit

At the end of hggit’s install process, hggit prints out a line to add to your ~/.hgrc. For MacPorts & Python 2.6, this will look like:

hggit=/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/hggit

Note: Installing hggit in a virtualenv is probably not a good idea unless you also are using a version of Mercurial installed in the same virtualenv. You could probably make it work, but it’s likely more work than it’s worth.

Set up and connect to a GitHub repository

Create an empty repository on GitHub to store your Mercurial project. Copy the git URL from the new repository. Append git+ssh:// to the beginning, and change the : to a /. For example, this:

git@github.com:mercurialpoweruser/mercurialrocks.git

becomes this:

git+ssh://git@github.com/mercurialpoweruser/mercurialrocks.git

This URL then goes in the [paths] section of your Mercurial repository’s .hg/hgrc. If you want to push and pull from GitHub by default, it will look like this:

[paths]
default = git+ssh://git@github.com/mercurialpoweruser/mercurialrocks.git

Or, if you already have a remote Mercurial repository that you want to keep using, you can add the GitHub repository with a different name:

[paths]
default = ssh://hg@bitbucket.org/mercurialpoweruser/mercurialrocks
github = git+ssh://git@github.com/mercurialpoweruser/mercurialrocks.git

Next, push your repository to GitHub for the first time, with hg push (if GitHub is your default) or hg push github (if GitHub isn’t the default).

Now, go to GitHub and reload the page, and revel in the feeling that you are living in a utopian future where cars fly, cold fusion supplies our electricity, and different DVCSs interoperate with each other.

Gotchas

If you find that Mercurial mysteriously stops pushing new changes to GitHub, even though there are new changes to push and it’s pushing them to Mercurial servers fine, you might want to take a look your bookmarks. Do this with hg bookmarks. Make sure that the changeset ID for master is the changeset ID of tip: hg tip. If they’re not the same, update the master bookmark with hg bookmark -f master. This happened to me when I used an older version of Mercurial (1.3.1) without the bookmarks extension enabled. Don’t do that.

Your mileage may vary

I haven’t used this technique on repositories that have lots of branches, file renames, merge conflicts, or other “fancy” distributed version control features. I wouldn’t advise switching your entire development team and mission-critical repositories over to this technique without some more extensive testing.

Is Git really better than X?

Most of Scott Chacon’s points in Why Git is Better than X are spot-on. The page could even be renamed “Why distributed version control systems are better than Subversion and Perforce,” since those two are the clear losers. (And yes, Bazaar is so slow I think it deserves to be listed twice in the speed section.)

I’ve used each of Git, Mercurial, and Bazaar for several months in medium sized teams; I’ve done quite a bit of branching and merging in Mercurial and Bazaar, and a fair amount in Git. Based on that experience, I am compelled to disagree here with a few of the points, specifically with regards to Mercurial.

Cheap local branching

Chacon argues that only Git offers cheap local branching, but Mercurial allows exactly the same work-flows that he outlines, and they are just as easy. Chacon claims that Git’s branching model is different:

Git will allow you to have multiple local branches that can be entirely independent of each other…

This isn’t quite correct. The real difference between Git’s branching model and the others is that Git does not assume a one-to-one relationship between a logical branch and a directory on the file-system. In Git, you can have a single directory on the file-system and switch between different branches with the git checkout command without changing directories.

Mercurial (and Bazaar) enforce default to1 a one-to-one relationship between a logical branch and directory on the file-system. You can, however, use the hg clone command to make a new, “entirely independent,” branch. You can clone a local directory, a remote directory over SSH, or a remote Mercurial repository. Cloning a repository that’s on the local file-system is the way to create cheap local branches in Mercurial. Mercurial will even create hard links to save disk space and time.

For example, in Git, you make a cheap local branch with:

git branch featurebranch
git checkout featurebranch

And in Mercurial, you make a cheap local branch with:

cd ..
hg clone master featurebranch
cd featurebranch

In Git, you switch between branches with git checkout branchname. In Mercurial, since branches are in a one-to-one correspondence with directories, you switch between branches with cd ../branchname.

In Git, you pull changes from a local branch with git merge branchname. In Mercurial, you pull changes into from a local branch with hg pull -u ../branchname.

Chacon’s four example work-flows using cheap local branches are not only completely possible in Mercurial, they are just as trivial as they are in Git. (The first three are even equally trivial in Bazaar; I’m not sure about the last one in Bazaar, though.)

Chacon says:

You can find ways to do some of this with other systems, but the work involved is much more difficult and error-prone.

Reasonable people might also disagree about the intuitiveness of the specific commands, but creating, updating, merging, and deleting them in Mercurial are one or two commands in the shell, just like in Git. None of these tasks are more difficult or more error-prone in Mercurial.

Chacon also says:

when you push to a remote repository, you do not have to push all of your branches. You can only share one of your branches and not all of them.

This is a little misleading. In Mercurial, since there’s usually a one-to-one correspondence between branches and local directories, and since you can’t be in two directories at once, you’re usually only pushing a single branch. In Mercurial, you generally don’t have to think about which branch you’re pushing, or whether you’re pushing other branches that shouldn’t be pushed, because you generally only ever push one branch at a time. I would actually turn this argument around, and claim that Git’s default of keeping multiple logical branches in a single directory forces you to worry about which branches you’re pushing, and actually makes cheap local branches harder and more error-prone in Git than in Mercurial.

Reasonable people might disagree about the intuitiveness of a branching model which does or does not assume default to a one-to-one relationship between logical branches and local directories, but that’s not why cheap local branches are so powerful in Git. Rather, it’s the ability to clone and merge between repositories locally that allows cheap local branches, and it’s just as easy in Mercurial as in Git, and nearly as easy in Bazaar.

GitHub

GitHub is unarguably the biggest community around any distributed version control system. But Chacon goes too far when he says:

This type of community is simply not available with any of the other SCMs.

BitBucket may not be as big as GitHub, but it’s a “socially-targeted” community where Mercurial users can “fork and contribute” to other Mercurial projects. It might not have as many projects or contributors as GitHub, but a sheer difference in size doesn’t translate to “simply not available.”

Bazaar’s not listed as inferior to Git on this point, I assume because of Launchpad. If Launchpad counts, why not BitBucket?

I’d also be interested to see usage numbers comparing GitHub and everyone’s favorite open-source community brontosaurus, SourceForge. Has GitHub surpassed them in projects or users or commits or downloads?

Minor haggling points

I also disagree with Chacon about the benefits of the staging area or index, but that’s purely a matter of personal preference. I never wanted anything like the staging area before I started using Git, and I don’t miss it now when I’m using Mercurial. If anything, the extra concept of a staging area makes Git slightly harder to learn; newbies coming from any other version control system have to be taught about the -a option to git commit right away, or else they wonder why their changes aren’t getting committed.

In the “Easy to Learn” section, Chacon highlights the add commands in both Mercurial and Git, indicating that they are the same; but they are pretty different. In Mercurial, add schedules previously un-tracked files to be tracked. In Git, add adds changes in a current file to the staging area, or index, including, but not limited to, previously un-tracked files. That section is also missing Mercurial’s mv command. Other potentially confusing differences include Git’s revert, which corresponds to Mercurial’s backout; and Mercurial’s revert, which can be duplicated using Git’s checkout.

Last, I’d like to see the ballyhooed Fossil included in this breakdown.

  1. Update: Thanks to David Wolever, who points out that hg branch and hg bookmark can be used to have more than one branch per directory in Mercurial. I didn’t even know about these commands when I first wrote this post. But I think my central point still holds; creating cheap local branches in Mercurial is just as easy as in Git. []

What a DVCS gets you (definitely)

Bill de hÓra’s What a DVCS gets you (maybe) has one of the clearest explanations of why DVCS is superior to centralized VCS:

But once you accept this model of having your sandbox under version control, a lot of the pain (and fear) of dealing with branches evaporates. Passing around changesets and patches becomes normal and logical.

After using a DVCS for years (Codeville), I don’t think there’s any need for “maybe” in this post’s title, either. DVCS is superior, plain and simple.