python – stream of bytes

Since the beginning of times, AMO has been using unique localization format codenamed as “monopo”. It bases on gettext, but instead of using string in source language as an ID:

msgid "What's your name?"
msgstr "¿Cómo te llamas?"

it uses a unique identifier as an ID and lists places where the entity is used in a comment above it:

#: views/addons/policy.thtml:78 views/addons/policy.thtml:79
msgid "test_whats_name"
msgstr "¿Cómo te llamas?"

This customization makes the PO file work very differently. You cannot localize using your po file only – you don’t know the source string you’re translating, so you need to have en-US file and your file and compare them. But the greater issue is an incompatibility of such format with most Gettext tools and libraries.

Gettext is probably the most popular localization format in the open source world. In result, over years, huge number of localization tools, libraries and API’s to languages has been created in order to simplify and support localization efforts. Due to the po->monopo switch AMO localizers were limited in their choice and even internally, AMO equivalents of get_entity function (named ___() and n___() for plurals) were pretty complex.

What once served us well, was not the choice we wanted to keep forever and by the power of bugzilla I have been assigned to challenge the status quo.

At the beginning there was darkness and lightning, all across the Earth… ekhm, not this one.

This hydra had four heads and all of them had to be attacked simultaneously in order to minimize the SNAFU window while in transition:

AMO localization code has been crafted to work with monopo
AMO templates and code has been calling for entities by the unique id – ___(‘test_whats_name’) instead of ___(‘What’s your name’)
AMO localization files were written using the monopo syntax
Our toolchain (like export-po.py or Verbatim) was customized to work with monopo

Wrt. the format, we decided to switch to normal gettext and keep context where needed (for example when there are multiply strings with the same english name – in order to avoid duplication).

On the 6th of August I had the weapons ready and Wil Clouser decided to make the switch before the next AMO release giving us little but 3 days to switch. Brave man he is.

We did the switch and on the 13th od August the first battle was called. We won on all fronts except the templates switch which against the plan happened to remove context reference from the function call.

In result in a few places where we kept the context we did not ask for it. So instead of ___(‘What’s your name?’, ‘test_whats_name’) we had only ___(‘What’s your name?’). This call did not work because in the .po file we had both, context and ID (and we cannot get the ID without a context in such case).

Yesterday I finished crafting a new weapon to fix this. The plan was to take en_US po file, iterate through all templates and PHP files in AMO code and check if the entity ID matches one from en_US po file and if the po file entity has context. If this is the case, then replace the function call with a new one containing context.

The only remaining issue was that what we tried to fix were the cases in our entity pool that had multiply ID’s identified by context only. And in this case we were using ID to match entity from PO file with an entity from the code. How do we know if we matched the right entity?

Consider the following case. We have an entity with id “Recover your password”. It may be a window title, a link to a recovery tool or title of an email. It may be translated differently depending on the context (and it would be translated differently in Polish) so we want to have multiple entities here. Since Gettext uses english string as an ID we cannot have 3 entities with the same ID so we need to add what’s called msgctxt. It looks like this:

msgctxt "email_title"
msgid "Recover your password"
msgstr "Odzyskaj hasło"

msgctxt "tools_link"
msgid "Recover your password"
msgstr "Odzyskiwanie hasła"

Now, in the broken template file we have ___(‘Recover your password’) and we need to match it to one of the entities to replace it with ___(‘Recover your password’, ‘tools_link’). But how do we get the right entity?

Fortunately, we were storing source comments for each entity which gave us a list of template files in which the entity has been used. And we preserved this while transiting. It looked like this:

#: views/addons/policy.thtml:78 views/addons/email.thtml:19 msgctxt "email_title" msgid "Recover your password" msgstr "Odzyskaj hasło"

#: views/addons/menu.thtml:28 views/addons/layout.thtml:79
msgctxt "tools_link"
msgid "Recover your password"
msgstr "Odzyskiwanie hasła

Now, we just took the comment, extracted files from it and made sure that the template file we’re fixing is on the list of files extracted from the comment. If it did, we’re using the right entity. If not, try the next one.

With this script in hand, last night we did the final switch, marked the bug as fixed and are open for last commits before the new AMO release!

The whole thing was possible thanks to L10n-drivers team and Addons team cooperation and Wil Clouser’s decision not to hit the nail on the head. Since today, AMO is a quite canonical gettext project 🙂

What’s next?

We’ll now work on tools (like extract-po.py) to regain the ability to narrow localization sections. Previously one could use the ID to find out which entities are from statistics or editor sections and deprioritize them. Now we’ll try to give him that info while extracting entities.

Projects need releases. It’s important. It’s like a birthday for a project – they get a milestone to mark the progress.
On the other hand we have developers. They need unlimited time and no deadlines. When one meet another, we have an interesting arm-wrestling battle between those two, but ultimately one has to obey to the Oath of the Bazaar, if you know what I mean.

Release

So, here we are, Silme was asking for a release for long enough and I postponed it over and over so it’s time to make the cut. Today, I’m proud to announce the very first official release of Silme – python l10n library. Silme has been announced to mozilla.dev.l10n long time ago, and since then it has been continuously developed in a small, but quite interesting project structure with support from Adrian Kalla, Stefan Plewako, Ricardo Palomares, Staś Małolepszy and management guidance from Seth Bindernagel.

It’s very, very hard to explain Silme concept to those who never tried to work on localization development.

Let me try: It’s like a DOM API for localization.

Works? Probably not… Well. Let me try the descriptive way. Silme is a toolset for a developer who wants to work on localization tools. It can read localization files, it can write them, it can modify them, it can search through them, it can process them, merge, split, localize and help you get some statistics out of the localization files. It probably can juggle them, although support for this is rather experimental.

Continue reading “Silme 0.5 released”

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31