The Case for Multilingual Educational Distributions

 

 

 

We have a tendency to make the same mistakes again and again. Educational distributions are not an exception.

Copyleft © Juan Rafael Fernández García July 2007.

Index

Multilingualization is not a simple process. Five aspects have to be taken into account:

I. Introduction
II. Locales
III. Fonts
IV. Input (keyboard and Input Methods)
V. Voices

What's the state of m17n in free software? Let's see

I. Introduction

The Reason Why (I).

Normally distributions respond

I. Introduction

The Reason Why (II): Spotting the Problem

The needs of an educational distribution are different:

II. Locales

The Free Software Solution

The three tasks i18n, l10n and m17n use the gettext technology in the world of free software.

How does a sysadmin cope with them? By means of locales.

II. Locales: What does "dpkg-reconfigure locales" do?

(An analysis of /var/lib/dpkg/info/locales.postinst)

  1. from the list of valid supported locales at /usr/share/i18n/SUPPORTED) create /etc/locale.gen with the selected ones
  2. generate the selected locales ("locale-gen invokes localedef for the chosen localisation profiles")
  3. if the admin chose to set a system-wide default environment locale, xx_XX.UTF-8 for example, /usr/sbin/update-locale "LANG=xx_XX.UTF-8" writes "LANG=xx_XX.UTF-8" into /etc/default/locale

II. Locales: How Are They Used?

The Configuration Mess

Locales can be set system-wide (at /etc/default/locale, or /etc/environment, or /etc/profile, or /etc/bash.bashrc...) ...

... or at a user's level (at ~/.bash_profile, but also at ~/.xsession, or anywhere in ~/.xsession.d/, or the configuration files of Gnome, KDE, XFCE...

... or at application level: LC_ALL=it_IT.UTF-8 date

The cause: historical. The outcome: high chances of problems.

II. Locales: Does the System Work?

Sort of, only sort of.

The third question is the easiest one, and all modern distributions got it right: UTF-8.

It's not necessary to stress that UTF-8 locales have solved many of the problems that legacy locales created.

The main advantage is that all characters can be displayed independently of the rules (e.g. dictionary order or currency sign) that apply to a single language and or country.

So what's the problem?

II. Locales. Problem I

How is a new language added?

In recent Ubuntus privileged users can use a widget (provided by language-selector). There's a whole system of dependencies: language-pack-xx brings language-pack-xx-base and recommends language-support-xx, which itself depends on the mozilla, openoffice.org and dictionary packages.

In Debians there used to be language tasks. Now a whole lot of packages have to be searched and added manually.

What if the distro is fixed in a CD or administered remote? Simple - you lose.

II. Locales. Problem II

Usage of pre-utf8 locales as defaults

/etc/locale.alias = /usr/share/locale/locale.alias is outdated

	Copyright (C) 1996-2001,2003
	deutsch         de_DE.ISO-8859-1
	spanish         es_ES.ISO-8859-1
	...

	$ LC_ALL=spanish date
		s�b jun 30 22:07:41 CEST 2007
	$ LC_ALL=es_ES.UTF-8 date
		sáb jun 30 22:07:55 CEST 2007
	

II. Locales. Problem II

Usage of pre-utf8 locales as defaults (cont.)

localeconf's main.py

		$Progeny: main.py,v 1.73 2002/06/11
		self.language_values = [ 	"en_US ISO-8859-1",
						"en_GB ISO-8859-1",
						"en_CA ISO-8859-1",
						"fr_FR@euro ISO-8859-15",
						"fr_CA ISO-8859-1",
						"de_DE@euro ISO-8859-15",
						"es_ES@euro ISO-8859-15",
						"es_MX ISO-8859-1" ]
	

eurosupport (how to set latin9)

II. Locales. Problem III

Hardcoded Locales

Hidden Hardcoding

Case One. A user has set a locale in ~/.bashrc and has an .xsession (or something at ~/.xsession.d) file

	
	if [ -f ~/.bashrc ];
	then
		source ~/.bashrc
	fi

(check your /etc/skel)

II. Locales. Problem III

Hidden Hardcoding (cont.)

Case Two. Gnome vs. KDE locale selection

In gdm select session KDE and language Italian - it won't work

Why? Because KDE doesn't heed the locale gdm sends it

	~/.kde/share/config/kdeglobals
	[Locale]
	Country=es
	Language=es

II. Locales. Problem IV

Localepurge, Our Most Dangerous Ennemy

Localepurge was created to save disk space, and is used heavily

dpkg-reconfigure localepurge creates /etc/locale.nopurge

Even if we reconfigure the languages we choose to save, how do we recover the deleted files unless reinstalling them all?

III. Fonts

An UTF-8 locale requires

How can we tell? By using pangrams

	LC_ALL=fr_FR.UTF-8 gnome-font-viewer Isabella.ttf

TrueType and OpenType fonts can contain no more than 65,536 glyphs = The UCS has over 1.1 million code points, but only the first 65,536 (the Plane 0: Basic Multilingual Plane, or BMP) have entered into common use.

III. Fonts - Are Free Fonts Complete?

Again sort of. But are they in your distro? And secondly, did anyone remember to install them?

Free generic Unicode fonts available in Debian Lenny: (see http://en.wikipedia.org/wiki/Unicode_typefaces)

Font No. characters No. glyphs
unifont (bitmap) 33580 33583
linux-libertine 1982 1985
ttf-bitstream-vera
ttf-dejavu 3525 3611
ttf-gentium 1469 1699
ttf-freefont 3914 5257
ttf-georgewilliams (Monospace, Caslon, Caliban, Cupola) Caslon: 3684 Caslon: 3686
ttf-junicode 2235 2256
ttf-sil-charis (a comprehensive inventory of glyphs needed for almost any Roman- or Cyrillic-based writing system, whether used for phonetic or orthographic needs. In addition, there is provision for other characters and symbols useful to linguists) 1958 3084
ttf-sil-doulos 1958 3083

 

IV. Input - the keyboard

Configuring the keyboard

You can exploit /usr/share/X11/xkb/symbols/xx to your profit

	key  {[          a, A, ae, AE                                 ]};
	key  {[          o, O, oslash, Oslash                         ]}; 

gives us «a A æ Æ o O ø Ø»

Or use the possibility or shifting the keyboard configuration: Desktop - Preferences - Keyboard - Griego politónico

	ὸ ὰνθροπος
	αᾳᾶὰά

IV. Input - input methods

e.g. XIM: AltGr + keys

	æßðđŋ@ł€¶ŧ«»¢“”nµ

Ctrl+Shift and type in "u" plus the hex number

e.g. SCIM: Ctrl + Space

阿ㄙㄉㄑㄖㄊ

V. Voices

OK, now we know how reading and writing can be implemented. But are they the only ways we interface with language? Don't we speak, don't we listen to people talking?

How does free software do as regards multilingual voice synthesis? (voice recognitions lags way behind)

Again it's no far, but it isn't there yet.

V. Voices

The Tasks Ahead

We need

V. Voices

Some examples

We are going to listen to examples of synthesis created with the versions of festival and espeak available in Debian Testing, using free voices.

Espeak.
espeak -vde -f voice_tests/aleman_utf8.txt -w voice_tests/aleman.wav; \
aplay voice_tests/aleman.wav
espeak -vfr -f voice_tests/frances_utf8.txt -w voice_tests/frances.wav; \
aplay voice_tests/frances.wav

V. Voices

Voice Synthesis with Festival

festival --language italian --tts voice_tests/italiano_latin1.txt
festival --language english --tts voice_tests/ingles.txt
festival --language spanish --tts voice_tests/espanol_latin1.txt

A festival example session

	$ festival
	festival> (voice.list)
	festival> (voice_Indisys_MP_es_pa_diphone)
	festival> (intro-spanish)
	festival> (SayText "hola, amigos")
	festival> (tts "voice_tests/espanol_latin1.txt" nil) 
	festival> (quit)