Tag Archives: HTML templating

Introducing Fragments

Today I am announcing Fragments, a project I’ve been working on for a few months. It’s on GitHub and PyPI.

Fragments uses concepts from version control to replace many uses of templating languages. Instead of a templating language, it provides diff-based templating; instead of revision control, it provides “fragmentation control”.

Fragments enables a programmer to violate the DRY (Don’t Repeat Yourself) principle; it is a Multiple Source of Truth engine.

What is diff-based templating?

Generating HTML with templating languages is difficult because templating languages often have two semi-incompatible purposes. The first purpose is managing common HTML elements & structure: headers, sidebars, and footers; across multiple templates. This is sometimes called page “inheritance”. The second purpose is to perform idiosyncratic display logic on data coming from another source. When these two purposes can be separated, templates can be much simpler.

Fragments manages this first purpose, common HTML elements and structure, with diff and merge algorithms. The actual display logic is left to your application, or to a templating language whose templates are themselves managed by Fragments.

What is fragmentation control?

The machinery to manage common and different code fragments across multiple versions of a single file already exists in modern version control systems. Fragments adapts these tools to manage common and different versions of several different files.

Each file is in effect its own “branch”, and whenever you modify a file (“branch”) you can apply (“merge”) that change into whichever other files (“branches”) you choose. In this sense Fragments is a different kind of “source control”–rather than controlling versions/revisions over time, it controls fragments across many files that all exist simultaneously. Hence the term “fragmentation control”.

As I am a linguist, I have to point out that the distinction between Synchronic and Diachronic Linguistics gave me this idea in the first place.

How does it work?

The merge algorithm is a version of Precise Codeville Merge modified to support cherry-picking. Precise Codeville Merge was chosen because it supports accidental clean merges and convergence. That is, if two files are independently modified in the same way, they merge together cleanly. This makes adding new files easy; use Fragment’s fork command to create a new file based on other files (or just cp one of your files), change it as desired, and commit it. Subsequent changes to any un-modified, common sections, in that file or in its siblings, will be applicable across the rest of the repository.

Like version control, you run Fragments on the command line each time you make a change to your HTML, not before each page render.

What is it good for?

Fragments was designed with the task of simplifying large collections of HTML or HTML templates. It could replace simpler CMS-managed websites with pure static HTML. It could also handle several different translations of an HTML website, ensuring that the same HTML structure was wrapped around each translation of the content.

But Fragments is also not HTML specific. If it’s got newlines, Fragments can manage it. That means XML, CSS, JSON, YAML, or source code from any programming language where newlines are common (sorry, Perl). cFragments is even smart enough to know not to merge totally different files together. You could use it to manage a large set of configuration files for different servers and deployment configurations, for example. Or you could use it to manage bug fixes to that mess of duplicated source files on that legacy project you wish you didn’t have to maintain.

In short, Fragments can be used anyplace where you have thought to yourself “this group of files really is violating DRY”.

Use it

Fragments is released under the BSD License. You can read more about it and get the code on GitHub and PyPI. And you can find me on Twitter as @glyphobet.

Special thanks to Ross Cohen (@carnieross) for his thoughts on the idea, and for preparing Precise Codeville Merge for use in Fragments.

Hackers, it is time to rethink, redesign, or replace GNU Gettext

GNU Gettext may be the de facto solution for internationalizing software, but every time I work with it, I find myself asking the same questions:

Why, in this age of virtual machines and dynamic, interpreted languages, do I still have to compile .po files to .mo files before I can use my translations?
I can reconfigure my web application, modify its database, and clear its caches whenever I want, so why do I have to do a code push and restart the entire runtime just so that “Login” can be correctly translated to “Anmelden”? Try explaining that to a business guy.
To translate new messages in my application, I have to run a series of arcane commands before the changed text is available to be translated. Specifically, the process involves generating .pot files, then updating .po files from them. Why isn’t this automatic?
Why is it still possible for bad translations to cause a crash? Translators do the weirdest things when presented with formatting directives in their translations… I’ve seen %s translated as $s and as %S, %(domain)s translated as %(domaine)s, and ${0} translated as #0, but the most common is to just remove the weird formatting directives entirely. And they all cause string formatting code to crash.
Why isn’t there a better option for translating HTML? Translators shouldn’t be expected to understand that in Click <a href="..." class="link">Here!</a>, “Click” and “Here!” should be translated, but “class” and “link” should not be. And they certainly can’t be expected to understand that if their language swaps the order of “Click” and “Here”, the <a> tag should move along with “Here”.
Why isn’t there something better than the convention to assign the gettext function to the identifier _, and then wrap all your strings in _()? Not only is this phenomenally ugly, but one misplaced parenthesis breaks it: _("That username %s is already taken" % username)
Why is support for languages that have more than two plural forms still an awful, confusing, fragile hack? Plural support was clearly designed by someone who thought that all languages were like English in having merely singular and plural. I’ve seen too many .po files for singular/dual/plural languages, where the translator obviously did not understand that msgstr[0] is the singular, msgstr[1] the dual, and msgstr[2] the plural.
Why, in this age of distributed version control, experimental merge algorithms, and eventually consistent noSQL databases, is the task of merging several half-translated .po files from several different sources still a nightmarish manual process?
Why, if I decide I need an Oxford comma or a capital letter in my English message, do I risk breaking all of the translations for that message?

There are libraries that allow you to use .po files directly, and I’m sure you can hack up some dynamic translation reloading code. Eliminating the ugliness of _() in code, and avoiding incorrectly placed parentheses after it, could be done with a library that inspects the parse tree and monkeypatches the native string class. Checking for consistency of formatting directives is not that hard. A better HTML translation technique would take some work, but it’s not impossible. The confusion around plural forms is just a user-interface issue. Merging translated messages may not be fully automatable, but at least it could be made a lot more user-friendly. And the last point can be avoided by using message keys, but that hack shouldn’t be necessary.

Gettext is behind the times. Or is it? Half of me expects someone to tell me that all these projects I’ve worked on are just ignoring features of Gettext that would solve these problems. And the other half of me expects someone to tell me I should be using some next-generation Gettext replacement that doesn’t have enough Google juice for me to find. (Let me know on Twitter: @glyphobet.)

GNU Gettext is is based on Sun’s Gettext, whose API was designed in the early ’90s. Hackers, it’s 2012. Technology has moved forward and left Gettext behind. It is time to rethink, redesign, or replace it.

The next big thing, part 2: Taking the web out of web applications

Part of an ongoing series.

A web application is just a stateless¹ application that responds to various requests by performing actions and providing resources. There’s no fundamental reason an application must only communicate over HTTP. Web applications are going to start adding alternative methods of interaction, and I think the first common one will be email.

Perhaps an example will best illustrate this:

Like many web forums, posts to Mosuki‘s discussion forums get mailed out in email. But, unlike any other web forums I know of, they also behave like mailing lists. All the emails have a reply-to header with an email address that identifies the message, the recipient, and the action to be taken if that email address is used. In this case, the contents of a reply email are posted to the forum exactly as if a reply had been posted via the website.

In other words, the action “post a message” can be accessed via a web page and a browser or via a reply-to header and your mail client.

There are other examples of this separation between input/output channels and the application logic. The most obvious is Twitter, which of course can be interacted with via HTTP or SMS². And the Son of Sam project intends to let you “use modern concepts like handlers, requests, responses, state machines” to interact with email.

Confirm a Facebook friend request, RSVP to an Evite, revert a Wikipedia edit, or reassign a bug report, just by replying to an email or sending an SMS.

There are a number of technical issues inherent a system like this. An application’s framework has to handle multiple input channels, and massage email bodies, HTTP requests, and other input into a least common denominator “request.” Authenticating a user via email, an intrinsically forgeable medium, and protecting against spam, are non-trivial challenges. And a suite of templates suddenly gets a lot more complex when it has to provide views for multiple types of interfaces³ .

This blurring of the line between email, HTTP, SMS, and other communications is not new, strictly speaking. But I think it will become commonplace and even expected. Rather than writing a modern (MVC, stateless, REST-ful, &c.) web application, people will be writing modern (MVC, stateless, REST-ful, blah, blah, blah) applications that have web interfaces, email interfaces, and whatever other interfaces they need.

Stay tuned for the next installment of The next big thing: Taking the relational out of relational databases.

More or less stateless, that is, authentication tokens like cookies notwithstanding. [↩]
As well as more standalone apps than you can shake a stick at. [↩]
Generating text and HTML responses for email that look good and work well in the top 75% of desktop and web email clients is a lot harder than testing a site’s HTML in Firefox, IE, Safari and Opera. [↩]

I wish XML were becoming obsolete faster

XML sucks at (almost) everything it’s used for. Google has open-sourced Protocol Buffer, a typed, backwards compatible, compact, binary data-interchange format. Combined with YAML for configuration and data-persistence files that need to be human readable, there’s even less reason to use XML for any data-serialization.

XML, in the form of (x)HTML, seems ok for markup, and the XML-based (x)HTML templating in Genshi is the best templating language I’ve ever used (and I’ve used XSLT, Mako, Mighty, PHP, and a few others). I wonder if the reason that HTML (and XML) templating is so difficult, and templating language code is often so ugly, is because XML is actually a poor solution for markup too. HTML is obviously here to stay, but it would be an interesting thought experiment to design a successor markup language that is not strictly hierarchical, more human-readable, and designed with templating in mind.

The two ugly faces of HTML generation

There are two quite different reasons for implementing HTML generation on a website. The first reason is to insert dynamic content, content that comes from a database or is algorithmically generated, into pages. The second reason is templating; to ensure that standard, site-wide parts of the HTML, such as headers and footers, are pulled from a single source. The goal of the first is to have a dynamic, database-driven site. The goal of the second is to avoid having to edit tens, or hundreds, of HTML files when the site design changes, and to avoid copy-and-paste coding.

Continue reading →

The next big thing, part 1: Resolving the conflict between Model-View-Controller and AJAX design patterns

or, how I learned to stop worrying and love the XMLHTTPRequest…

This is the first part of what will become an ongoing series.

If you’ve built a website in the last few years, most likely you’ve adopted an architecture similar to Model-View-Controller, or MVC. If not, well, either your website is terribly simple, you haven’t had to modify it yet, or your code is spaghetti and you should be fired. Just kidding. (Or maybe you’ve come up with an even better architecture, in which case you should share your insights with us mere mortals.)

In MVC architecture, the model reads and writes data to and from a back-end data-store, and organizes the relational data in a nice, hierarchical fashion to be used by the controller. The view accepts input from the controller and generates output HTML, XML, RSS, JavaScript, SVG, PDF, or whatever you want to send to the user’s browser. And the controller accepts browser input, figures out what to query the model for, and picks which view to use and what data to send it.

figure 1: The traditional MVC architecture.