Internationalization in Node.js

A primer on i18n in Node.js by Martin Heidegger on the night between Dec. 24 and Dec. 25 2015

Many ways lead to the metaphorical rome to achieve i18n and by extension l10n in Node.js. Several packages such as i18next, i18n or node-gettext provide quite accepted implementations for internationalization, yahoo also has pushed the game and published formatjs. But even with those packages at hand i18n can be tricky! Target platform, workflow, human resources, user experience, revisioning - all of that plays an important role when choosing how to setup internationalization for a system. To give a deeper insight in the matter I collected my understanding of i18n in this article, may it help you to do it better.

Note: I might have overdone the annotations a little.

Different purposes different tools

or: Not everything in life is about express and jquery

When you search for i18n on npm the first result shows i18n from mashpie. also almost requires an express-request object to determine the given locale. This makes it instantly unattractive for CLI tools, desktop applications or static site generators or web applications that don't store user information in a cookie.

CLInterationalization

Many CLI applications simply don't offer translations. There is however support for translation with yargs. i18n-core on the other hand is a library that I created to implement workshopper. It acts on a significantly lower level yet still a bit more complex than y18n - the library used by yargs. The difference is subtle yet interesting: i18n-core implements beside gettext placeholder (the same that used by yargs) also handlebars like other cli translation system that I heard of which is ember-i18n. The difference between gettext and handlebars becomes relevant during escaping.

Example:

__("Hello {{x}}", {x: 'You & Me'})__

becomes Hello You & Me due to the html escaping of & which is something very undesired in CLI tools.

Another topic of Internationalization in the tty is that terminals render some east-asian characters as double width characters. wcwidth takes care of calculating the size of a string while respecting their double-width-ness. Since the operation takes quite some time when done often I implemented wcstring that has a few more methods up it sleeve.

Static internationalization

With static site generators, a very common approach is to just simply load the dynamic data using ajax like i18n-properties for jQuery is probably the simplest. Dynamic content is indexed by google but - and maybe I am a little old-fashioned that way - I prefer a static html site generated for every language such as I used at the NodeSchool homepage and as it is added on the official Node.js homepage. Both of the solutions are handmade. It would kind-of be cool if there was good package for that 😄.

International standalone apps

From researching the Atom issues I found that it just utilizes grit which is Google Chromes native translation support and it build strongly on XML. The only other approach of translation in a standalone Node.js app that I am aware of is in git-it-electron which unfortunately is rather crude.

Dataformat JSON

or: Why you should not use JSON for i18n

One of the most common problems of JavaScript based libraries is their heavy reliance on JSON. JSON is native in JavaScript and as such a comfortable format but I can become very hindering in the process of translating anything - even a little - complex. This is due two 4 major shortcomings:

The last comma: In valid JSON the last element in an array may not trail wiht a comma. That is not a problem if you and all of your colleagues are in the habit of putting the comman at the beginning of a line. But if you - or one of your translation staff, puts the comma at the end of the lines you They will aallways have two lines in the commt statemnts rather than one and this will result in merge conflicts in Git.
Multiline & escaping: Translation is usually not limited to short words like "hello World" but can contain whole sentences, even paragraphs. Writing paragraphs in valid JSON is painful. Since you can not use regular line breaks you have toadd escaped linebreaks \n to your file, same thing goesor the character " (which is very common in text). It is hard to read escaped multiline text. Particularily since its good practice to have multiline text with formatting inGIT (makes for better understandable and readable Diffs.
Comments: You definitely want to have comments in translations. TEither notes of the translator about the importance of phrases, links that explain the word usge or links to the app/website that uses this string (to see it immediately in context of the app).
Lack of editors for Luser's*: It might see an obvious problem but Luser's usually do not know how to edit simple text files properly. Even more so JSON. You can get a good editor installed in all their systems but even then they need to use it instead of their beloved Word or Excel. It might sound like a trivial issue but in reality a WYSIWYG editor will reduce the friction between the developer and the translator and he/she can experiment with the result.

Alternatives to JSON

or: Options! Options! Oh, so many options! Lucky for us, there are a lot of alternative ways to provide data for the respective library. Lets look at some of them:

PO/MO Files

In the php community (Wordpress/Drupal/etc.) .po and .mo can be used in files can be used with get_text. PO files have a specialized Editor called Poedit and the files can be used in Node.js using node-gettext. This file format is old and many technical translators surely have come in touch with it. It supports comments and there are plenty of online tools to process and edit them. However: The raw file format is not necessarily easy to use with git.

Database

<insert database here> can store translations like Drupal does. A Database like redis or leveldb can even be incredibly fast. Using a database opens the possibility to add nice editing- and collaboration tools on top. The problem with it lies in the distance between database and user: to bring the data to the user you inevitably have to transform the data somehow. If you decided to transform the data to YAML, JSON or another format, the question arises: why to store it in the db at the first place? A Mini-Mongo-Like approach to syncing could improve on that. I wonder if someone develops in this direction.

YAML & Message Properties

One of the most obvious alternatives is yaml since it deals with the first three issues. It is easier to write, easier to maintain and if you are okay with the friction between you and the translator it is a good format to choose. That is probably the reason for many i18n libraries to implement yaml support like I published with i18n-yaml .oO(Still waiting on the PR to go thorough...). It is in many ways similar to Properties files which are common in the Java world (surprisingly supported by ember-i18n. Yaml is my format of choice when I have to maintain the translation myself because they are versioned with git.

Excel (CSV)

Obvious other alternatives would be CSV or Excel files. A direct excel -> node library like node-xls makes the communication between the translator and you easier and you don't need to convert from excel to json with other tools. Having that process will still leave a few problems open: Excel files in git increases the git size significantly and diffs are possible but not pretty. CSV files work better but you still have all the strings in one line and as a result merge conflicts are the normal rather than the exception. I have never tried it but you could mitigate this fact using daff.

Google Docs

You could go with Google Spreadsheets for translation instead. Google Speadsheets have the nice ability to both have a relatively good, understandable user interface while also providing a direct API. It is not a big challenge to implement

Partitioning

or: This translation file is waaaaay to big.

If you have more than 50 Strings that need translating then its already difficult to keep up. Splitting up file (partitioning) is a good productivity booster.

Language-first partitions

Usually people partition their translation strings starting with the locales: 'en.json', 'ja.json', .... This way partitioning is good for using git. On git you can clearly see which translations have changed. It is also a good way to partition language-first when you

Language-last partitions

A language-last partition, on the other hand, could look like this:

landing_page:
  title:
    en: Test App
    ja: テスト アプリ

The advantage of this kind of partitioning shows in the editing process: This way you can see easily when a new entry is added and a by checking the git diff you can easily figure out which languages should be fixed. It is also a little kinder towards git PR's. The obvious disadvantage comes when you have to load the complete file with all translations in advance.

Subparitions

Some libraries support a arbritary partitioning of structure and files. This means you could add a en.json for english and a en.landing.json for strings with an en.landing prefix. This way you are able to store the relevant data in sections as you like. This can be a very usefull feature to keep the connection up.

Async VS. Sync

or: The age old question of letting the user wait.

In Node.js, eventually everything is usually better async. However most if not all libraries discussed here are synchronous. One can argue that this is good that way because the language definition doesn't change very often and we just need to make sure that everything is loaded ahead in time (fs.readFileSync helps. Right?). In many cases you don't run into any issues with this strategy. In two cases it does:

Loading of language date on start of Website: Unless you partition your data well the loading of the data can take some time on the client. Time during which the user waits...
Combination of async data: __('Hello {{key}}', {key: fs.createReadStream('./name')}) Who said that the keys need to be synchronous data? Iteracting with Streams and other async data as your data source is really tough.

Streams and async data becomes an real issue if you have a lot of translation going on. There is no async translation method I know of that does support async translations so I am giving here a shot at an API for a possible i18n-core-async

__(console.log, 'some.key', {arg: 'x'}, function () {
    
})

This way we can execute the output of the key with console.log and then move on.

Template Strings

or: This ES6 feature could revolutionize translation. Usually Strings have been parsed for a possible placeholder often times. ES6 adds a new way to enter strings which is called Template String. Those strings basically fulfill the job of a parser & compiler. However: babel is now a beta and we can not expect all of the feature on our client. Still, I am looking forward to the first library that ditches the custom compiler and goes with 100% Template strings.

Global referencing

or: Every time I have to pass in the same variable.

WIP

Contexts

or: I really don't want to know why to need contexts for translations.

WIP

Markdown

WIP

Intl

or: How the w3c isn't able to submit a MessageFormat

WIP

i18n Streams

or: Translate everything, every last, tiny bit of it!

WIP

martinheidegger/internationalization-nodejs.md

Internationalization in Node.js

Different purposes different tools

CLInterationalization

Static internationalization

International standalone apps

Dataformat JSON

Alternatives to JSON

PO/MO Files

Database

YAML & Message Properties

Excel (CSV)

Google Docs

Partitioning

Language-first partitions

Language-last partitions

Subparitions

Async VS. Sync

Template Strings

Global referencing

Contexts

Markdown

Intl

i18n Streams