Tag Archives: hacking

Monolithic repositories versus project repositories

Recently, Hacker News got ahold of Gregory Szorc’s article on monolithic repositories, and even Wired is weighing in on how big codebases should be organized. While the discussion is interesting, it seems to focus on two extremes: on the one hand, putting all of a company’s code into a single monolithic repository, and, on the other hand, breaking a big company’s code code up into many, many small repositories. Both extremes are too simplistic. The better approach is to align repository boundaries with the software’s build, deployment, and versioning boundaries. What does this mean exactly? Perhaps it is best illustrated with some examples:

You have a small app that knows how to deploy itself, and you rarely, if ever, need to deploy anything other than the current version of it. Keep the deployment scripts in the same repository as the app.
You have a few existing APIs written in the same language and built using similar libraries. You’re about to build a replacement for one of those APIs in a different, high-performance language, which will use a different storage backend. Not only will the build and deployment steps for this new code be different, but you will want to deploy new versions of both the new and old app independently. Here you have build, deployment and versioning boundaries, so it’s natural to use a different repository for the new code. Since it’s a different language, the new code won’t be relying on any existing code, so there’s no temptation to copy and paste code into the new project.
You need to support a complex set of deployment environments, clusters of virtual machines running different services, connecting to different backends, testing, staging, and production environments running different versions of services, and so on. Here we have clear versioning and deployment boundaries: your deployment process needs to support deploying different versions of the code. Your deployment process needs to understand and know about different versions of what it is deploying, since you might need to rollback and deploy an earlier version. It also needs to have the ability to deploy, for example, the current stable version of the site backed by MySQL database to the production environment, and also deploy the head of an experimental branch backed by PostgreSQL to the staging environment. You won’t be able to encapsulate the switch from MySQL to PostgreSQL in a single merge of a feature branch anyway, so it’s just a headache to maintain deployment scripts which know about how to deploy to MySQL in the same branch where you have just removed all your MySQL dependent code. So the deployment scripts are better kept in their own repository.
You have an API for internal use only, and a single consumer of that API, both running inside your private network. Here you have total control over versioning: you can change the API and its consumer in a single commit, ensure that both are updated at deploy time, and there’s no versioning boundary. If the API and its consumer are also written in the same language, the there should be little to no reason to keep them in separate repositories.
You have an API accessible over the public internet, and Android and iOS apps which talk to that API. Here you already have significant versioning boundaries. You can’t be sure that every app out there is up-to-date, so you have to keep the old APIs up and running for a while, or make sure the latest APIs are backwards compatible. And you have significant deployment boundaries: you wouldn’t want to wait two weeks to deploy an update to the API because you are waiting for the new API support in the iOS app to make it to the App Store. Since any version of your API must support several different versions of your mobile apps, and any version of your mobile app must be able to talk to several different versions of your API, there is again no hope of rolling any change into a single (merge) commit, and there’s no added cleanliness or simplicity to gain from having these different pieces of the software in the same repository. So this code can be split into multiple repositories.

There are a few more things to remember when setting up repositories.

Splitting vs. Merging

It’s far simpler to split out part of a monolithic repository than is to merge two independent repositories:

If you’re splitting, just create a new repository, push the old code there, remove everything but what you want to split out, and commit. You can then re-organize it if you like, but you don’t have to (and if you’re using git, you can even filter the history). All the files keep their history, and you don’t have to worry about file name collisions.
If you’re merging, you most likely have to do some top level reorganization first, then you can merge one repository into the other, but now looking at history before this merge point will show a jumbled mess of commits from both repositories.

Increasing Fragmentation

If you feel like you’re being forced into creating a third repository to store code that is needed by two other repositories, then that’s probably a sign that those two repositories should be a single one. This is a common trap that projects get into; once they have split their repositories too finely, then the only solution seems to be more splitting. When considering this option ask yourself: do these two repositories really have different build, deployment, and versioning boundaries? If not, bite the bullet and merge the repositories, rather than creating a third one for the shared code.

Ease of Access

Ease of access to code is often presented as an advantage to the monolithic repositories model. But this argument is unconvincing. A programmer can still have access to all the company’s code, even if that requires cloning multiple repositories. And the vast majority of cases, a programmer is going to work on two, maybe three, different projects at the same time. It’s not as if Google is filled with programmers who work on WebM encoding for YouTube on Monday, map-reduce for Google search on Tuesday, CSS and JavaScript for Gmail on Wednesday, Java for Android on Thursday, and Chrome on Windows on Friday. Programmers like that are extremely rare, and you shouldn’t optimize your repository structure to make them happy, especially not if it means forcing everybody else to download and track changes to large amounts of code they will never touch.

In Conclusion

To sum up, both the one monolithic repository dogma, and the many small project-based repositories dogma are oversimplified to the point of being harmful. Instead, focus on splitting your code into repositories along its natural versioning, build, and deployment boundaries.

Let’s code about bike locks some more

I just got sucked into Peter Norvig‘s Let’s Code About Bike Locks. If you haven’t read that yet, read it first, at least the beginning. Otherwise this will make no sense.

The strategy of starting from the first tumbler and then calculating the subsequent ones seemed like it could be improved on. What if we looked at all the four-letter English words, and chose the most common letter, on any tumbler, fixed that letter on its tumbler, and then continued on to the second most common letter, on any tumbler, and so on? How well could we do?

So I coded it up, and here’s the result, a lock that can make 1,410 words from Norvig’s list at http://norvig.com/ngrams/words4.txt, 170 more words than his best:

Lock: ABCDGLMPRST AEHILNORUY ACEILMNORST ADEKLNORSTY

I believe Norvig’s strategy of improving a lock with random permutations also would be less likely to improve on this lock. Changing any letter would, by definition, be choosing a letter that occurs in that position in fewer words. However, it’s still possible to improve; there might be some better letter choices that, while poorer overall, are still better for the specific other letters already chosen for the other tumblers.

Update 15 Jun 2015: Someone was wrong on the internet and this time it was me! Astute readers will notice that a tiny off-by-one bug in my implementation (see the fifth revision) led it to generate a lock with three tumblers with eleven letters each, and one tumbler with ten letters.

The new best lock from this implementation only generates 1,161 words, leaving Norvig’s solution the best still:

Lock: ABCDLMPRST AEHILNORUY AEILMNORST ADEKLNOSTY

Why you shouldn’t use git merge –rebase

There is a common belief that git merge --rebase is somehow preferable to normal merging. The general assertion seems to be that a linear history is somehow “cleaner”, “easier to understand“, and that normal merging introduces “extra commits” and “merge bubbles“, the latter presumably being only slightly less objectionable than economic bubbles. Some organizations even go so far as to mandate always merging with --rebase. But ask someone to give a real, technical justification—just one—for this belief, and they mumble some aesthetic vapidities and then start talking about the weather.

Let’s put aside for a moment the ridiculous assertion that a directed acyclic graph is somehow more difficult for programmers—programmers!—to understand than a linear history. I want to show you how normal merging is in fact preferable to using --rebase all the time.

Continue reading →

Two new projects: German Grammar and Möbius

I’ve been hacking on two new projects in my spare time.

The German Grammar Explorer (mainly the German Declension Explorer) is helping me wrap my head around some of the more complex patterns in the German language. It’s also an experiment in deliberate synæsthesia; It uses a palette of eight colors plus white to color-code similar patterns and related morphosyntax. The idea is to give a general feeling for when the general patterns of the language are broken.

Möbius is a totally useless experiment in binding scroll events and doing funny stuff with them, and experimenting with some newer features of HTML 5 and CSS 3.

Hackers, it is time to rethink, redesign, or replace GNU Gettext

GNU Gettext may be the de facto solution for internationalizing software, but every time I work with it, I find myself asking the same questions:

Why, in this age of virtual machines and dynamic, interpreted languages, do I still have to compile .po files to .mo files before I can use my translations?
I can reconfigure my web application, modify its database, and clear its caches whenever I want, so why do I have to do a code push and restart the entire runtime just so that “Login” can be correctly translated to “Anmelden”? Try explaining that to a business guy.
To translate new messages in my application, I have to run a series of arcane commands before the changed text is available to be translated. Specifically, the process involves generating .pot files, then updating .po files from them. Why isn’t this automatic?
Why is it still possible for bad translations to cause a crash? Translators do the weirdest things when presented with formatting directives in their translations… I’ve seen %s translated as $s and as %S, %(domain)s translated as %(domaine)s, and ${0} translated as #0, but the most common is to just remove the weird formatting directives entirely. And they all cause string formatting code to crash.
Why isn’t there a better option for translating HTML? Translators shouldn’t be expected to understand that in Click <a href="..." class="link">Here!</a>, “Click” and “Here!” should be translated, but “class” and “link” should not be. And they certainly can’t be expected to understand that if their language swaps the order of “Click” and “Here”, the <a> tag should move along with “Here”.
Why isn’t there something better than the convention to assign the gettext function to the identifier _, and then wrap all your strings in _()? Not only is this phenomenally ugly, but one misplaced parenthesis breaks it: _("That username %s is already taken" % username)
Why is support for languages that have more than two plural forms still an awful, confusing, fragile hack? Plural support was clearly designed by someone who thought that all languages were like English in having merely singular and plural. I’ve seen too many .po files for singular/dual/plural languages, where the translator obviously did not understand that msgstr[0] is the singular, msgstr[1] the dual, and msgstr[2] the plural.
Why, in this age of distributed version control, experimental merge algorithms, and eventually consistent noSQL databases, is the task of merging several half-translated .po files from several different sources still a nightmarish manual process?
Why, if I decide I need an Oxford comma or a capital letter in my English message, do I risk breaking all of the translations for that message?

There are libraries that allow you to use .po files directly, and I’m sure you can hack up some dynamic translation reloading code. Eliminating the ugliness of _() in code, and avoiding incorrectly placed parentheses after it, could be done with a library that inspects the parse tree and monkeypatches the native string class. Checking for consistency of formatting directives is not that hard. A better HTML translation technique would take some work, but it’s not impossible. The confusion around plural forms is just a user-interface issue. Merging translated messages may not be fully automatable, but at least it could be made a lot more user-friendly. And the last point can be avoided by using message keys, but that hack shouldn’t be necessary.

Gettext is behind the times. Or is it? Half of me expects someone to tell me that all these projects I’ve worked on are just ignoring features of Gettext that would solve these problems. And the other half of me expects someone to tell me I should be using some next-generation Gettext replacement that doesn’t have enough Google juice for me to find. (Let me know on Twitter: @glyphobet.)

GNU Gettext is is based on Sun’s Gettext, whose API was designed in the early ’90s. Hackers, it’s 2012. Technology has moved forward and left Gettext behind. It is time to rethink, redesign, or replace it.

A two-step refactoring tactic

In the middle of a major, hairy refactoring recently, I codified a tactic for refactoring that I’d been using, unconsciously, for years. The tactic is to break down refactoring into two modes, and deliberately switch between them:

Clean-up mode: Clean up and reorganize the code, without changing the code’s external functionality.
Change mode: Make the smallest, simplest self-contained change, making sure the code is functional again when it’s completed.

You start in clean-up mode, and switch back and forth as needed. This tactic is probably unsurprising to experienced programmers, but for those of you not yet as hard-core as Donald Knuth, let me explain why it’s a good idea.

Separating clean-up and changes into discrete modes of working gives you less to think about at any one time. A refactor can be very invasive—code that was closely tied together is now completely decoupled, or vice-versa; functions, objects and modules become obsolete or nonsensical, or split apart or merge in unexpected ways; identifiers become ambiguous or collide. If you’re trying to reorganize everything at the same time you’re juggling old and new code in your head, it’s easy (for anyone) to get lost in the maze.

When you’re consciously in clean-up mode, you can focus just on tasks like making sure variables have unambiguous, correct names, object / module / function / &c. organization and boundaries are sensible, and so on. You’re not changing any functionality, so your tests (you do have tests, right?) still pass, and the application still behaves as it always has (you did test it manually, right?). And you can be liberal in your clean-up; if you end up improving code that isn’t ultimately affected by the refactor, there’s no harm in that.

Once you feel like the code is clean and ready for the changes in functionality, commit your clean-up (you are using version control, right?). I usually use a commit message like “clean-up in preparation for….”

Now switch to change mode, and start making the changes. Quite often, you’ll discover something else that needs to be cleaned up. But you’re in change mode, not clean-up mode. Since you’re only making the smallest, simplest self-contained change, this new clean-up can probably wait until later. But if it can’t wait until later, then roll back your changes (or save a patch, or store your work in the attic, or the stash, or shelve it, or whatever the kids are calling it these days). Complete the new clean-up, commit it, and then go back to change mode and your half-completed changes.

This tactic has some additional benefits that might not be immediately obvious, too:

Clean-up mode forces you to read over all of the code that’s going to be changed and ask, “How is this code going to be affected by the changes I’m planning?” Sometimes you discover unanticipated edge-cases or bugs in your planned changes. Sometimes you realize the whole plan is flawed, and have to go back to the drawing board. And sometimes, after a good, hearty clean-up session, you realize that you can make the required changes in a less invasive, simpler way. I find that most refactors are more clean-up than anything else.
If the codebase has a “master” or “main” branch, and you use a “feature” or “working” branch for your refactor, you can put the clean-up commits into the master branch (they don’t change any functionality, right?) and only commit the functional changes to the feature branch. What’s the point of that? Well, everyone else gets to benefit from your code clean-up right away, and when you do end up merging your functionality changes into the master branch, because it’s a smaller delta, there’s a smaller chance of conflicts.

The next time you’re staring down the barrel of a nasty refactor, and cursing the person who didn’t fully think think through the business requirements, try this. It won’t make every refactor a piece of cake, nor will it become the hot new programming methodology acronym down in the valley (2SR, anyone? 2SRT?), but I guarantee you’ll be glad you tried it.

git ÷ wtf = hg

Too obscure for a t-shirt, but the idea struck me and I couldn’t resist:

Thought-provoking JavaScript articles

Prototype.js developer Kangax has some good JavaScript articles on his blog, including one about delete in JavaScript and another about subclassing Arrays.

Ten more ways

Following his hilarious Ten Ways to Wreck Your Database, Josh Berkus outlines ten ways to destroy your open-source community. When I started at BitTorrent back in 2005, it had long been guilty of 3, 4, 5, 8, 9 and 10, and it never recovered.

Tracer T

I gotta get me a copy of this “Tracer T” program:

I’ve heard you can also use a program called “Pin G” to view people’s PIN numbers.

(note: if you’re not a hacker this will make zero sense to you, but trust me, it’s hilarious.)

glyphobet • глыфобет • γλυφοβετ

musings over a tuna fish sandwich