Digital PDFs

AA-LY26B-TE

June 1990

61 pages

Original

2.1MB

Document:	ULTRIX Guide to Developing International Software
Order Number:	AA-LY26B-TE
Revision:	0
Pages:	61
Original Filename:

OCR Text

ULTRIX

Guide to Developing
International Software

Order Number: AA-L Y26B-TE

ULTRIX

Guide to Developing
International Software
Order Number: AA-LY26B-TE
June 1990

Product Version:

digital equipment corporation
maynard, massachusetts

ULTRIX Version 4.0 or higher

Restricted Rights: Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in
subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause of DFARS 252.227-7013.

The information in this document is subject to change without notice and should not be construed as a commitment
by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may
appear in this document.
The software described in this document is furnished under a license and may be used or copied only in accordance
with the terms of such license.
No responsibility is assumed for the use or reliability of software on equipment that is not supplied by Digital or its
affiliated companies.

The following are trademarks of Digital Equipment Corporation:

IJllmaamD
CDA
DDIF
DDIS
DEC
DECnet
DEC station

DECUS
DECwindows
DTIF
MASSBUS
MicroVAX
Q-bus
ULTRIX
ULTRIX Mail Connection

ULTRIX Worksystem Software
VAX
VAXstation
VMS
VMS/ULTRIX Connection
VT
XUI

UNIX is a registered trademark of AT&T in the USA and other countries.
XlOpen is a trademark of X/OPEN Company Ltd.

Contents

About This Manual
Audience

vii

Organization

vii

Conventions

viii

Internationalization Overview

1.1

The Purpose of Internationalization

1-1

1.2

The ULTRIX Internationalization Solution

1-2
1-2

1.2.1

International Keyboard Support

The Message Catalog System

2.1

Creating a Message Catalog

2.2

String Extraction

2-2

2.3

Format of the Message Text Source File

2-3

2.3.1
2.3.2

Set and Message Numbers ..............................................................
Mnemonics ....................................................................................

2-3
2-5

... .... . . ....... . ..... . . .. . ..... . .. . ... ........ ........... .......... ... . ............. . .... .

2--6

2.4

Using gencat

2.5

Library Routines

2.5.1
2.5.2

.......................................................................

.......................................................................................

Using catopen
Using catgets
..............................................................................................

2-1

2-8
2-8
2-8

2.6

Using trans

2-9

Program Localization

3.1

The Announcement Mechanism

3-2

3.2

Announcement Categories

3-3

................................................................................

3-4
3-4
3-4

...........................................................................................

3-5

3.3

Setting the Program Locale

3.4

Setting a Specific Category

3.5

Setting all Categories

3.6

The C Locale

3.7

Internationalized Program Example

Language Support Databases

4.1

The Codeset Definition

4.2

The Property Table

4.3

The Collation Table

4.4

The String Table

4.5

The Conversion Tables

Database Source Language Syntax Description

A.l

Rules for Building Identifiers

A-I

A.2

Rules for Building Strings

A-I

A.3

Rules for Building Constants

A-I

A.4

Rules for Separating Tokens, Specifying Comments, and Using Directives

A-2

A.5

EBNF Description

A-2

Example Source Language File

Associated Reference Pages

3-5

4-2
4-3

................................................................................. .

4-4

..................................................................................... .

4-7
4-8

...................................................................................

Glossary
Index
Examples

A-I: EBNF Description of the Database Source Language. .....................................

4-1
A-3

B-1: Example of a Language Support Database Source File

B-1

4-1: Structure of the Source File

ivContents

...................................

Figures
2-7

2-1: Creating a Message Catalog

Tables
2-1: Escape Sequences Recognized by gencat

......................................................

2-4

4-1: Properties and Character Classification

............... .... .................... ..................

4-4

4-2: Examples of Primary and Secondary Weighting
4-3: Mandatory Strings in the String Table

............................................

4-7

..........................................................

4-8

Contents v

About This Manual

The ULTRIX Internationalization package provides tools and functions to allow you
to write software that can be used in a number of nations. The program interface
appears to users in each nation as if designed for that nation's users. For example,
messages appear in the native language of the user and the full character set for the
user's language is available.

Audience
This guide is intended for experienced ULTRIX application programmers writing
software intended for multinational or non-English language use. Translators who
translate the messages displayed by international software might also find this guide
useful. Application programmers should read this entire guide in conjunction with
the internationalization package reference pages. Translators should find Chapter 1,
Chapter 2, Appendix B, and the trans(lint) translation editor reference page the
most useful.

Organization
This guide consists of four chapters, three appendixes, and a glossary.
Chapter 1

Internationalization Overview
Introduces the basic concepts and components of the ULTRIX
internationalization package.

Chapter 2

The Message Catalog System
Describes message catalogs, international library routines, and
the associated tools you use to generate and fill them.

Chapter 3

Program Localization
Explains how the language requirements of international
software are announced to the system.

Chapter 4

Language Support Databases
Describes the language support databases that allow programs
to operate in various native languages, and the source language
used to create input files for the i c compiler.

Appendix A

Database Source Language Syntax Description
Gives an Extended Backus-Naur Form (EBNF) notation of the
syntax recognized by the ic compiler.

Appendix B

An Example of a Source Language File
Gives an example source file for a language support database.

Appendix C

A List of Associated Reference Pages
Lists the ULTRIX reference pages associated with the
internationalization package.

Conventions
The following conventions are used in this manual:
%

The default user prompt is your system name followed by a right
angle bracket. In this manual, a percent sign ( % ) is used to
represent this prompt.

user input

This bold typeface is used in interactive examples to indicate
typed user input.

system output This typeface is used in interactive examples to indicate system
output and also in code examples and other screen displays. In
text, this typeface is used to indicate the exact name of a
command, option, partition, pathname, directory, or file.
UPPERCASE
lowercase

The ULTRIX system differentiates between lowercase and
uppercase characters. Literal strings that appear in text,
examples, syntax descriptions, and function definitions must be
typed exactly as shown.

rlogin

In syntax descriptions and function definitions, this typeface is
used to indicate terms that you must type exactly as shown.

filename

In examples, syntax descriptions, and function definitions, italics
are used to indicate variable values; and in text, to give references
to other documents.

[ ]

In syntax descriptions and function definitions, brackets indicate
items that are optional.
In syntax descriptions and function definitions, a horizontal
ellipsis indicates that the preceding item can be repeated one or
more times.

cat(1)

Cross-references to the ULTRIX Reference Pages include the
appropriate section number in parentheses. For example, a
reference to cat(l) indicates that you can find the material on the
cat command in Section 1 of the reference pages.

IRETURNI

This symbol is used in examples to indicate that you must press
the named key on the keyboard.

viii About This Manual

Internationalization Overview

The ULTRIX Internationalization package provides tools and functions to enable the
internationalization of programs and the environment in which they operate.
Internationalization is the process of designing or adapting programs to meet
international requirements, such as those of multiple local languages and the specific
character sets associated with them.

1.1 The Purpose of Internationalization
An internationalized program:
•

Allows users to interact with the program in their own languages

•

Reflects the culture of the users' regions

Conventions for representing cultural data can vary from one country to another and
from region to region within a single country. For example:
•

Number representation
England and France both represent numbers using radix characters and commas,
but these symbols are interchanged (2,345.77 in England and 2.345,77 in
France).

•

Currency symbols
In Italy five thousand Lire is represented by L. 5.000 and in Greece five
thousand drachmae is represented by 5,000 Dr.

•

Date order
October 7, 1986 would be represented as 10/7/1986 in the U.S., 7.10.1986 in
Germany, and 1986/10/7 in Japan.

•

Codeset
In Switzerland, a program might have to run using Italian, German, and French
codesets (modified for local use).

You can meet these internationalization requirements by writing programs that make
no assumptions about language, local customs, or coded character sets. Such a
program is said to be internationalized. Data specific to any particular language,
including cultural data, and the codeset, are held separate from the program logic.
The process of establishing such data is referred to as localization. Run-time
facilities bind a program to the appropriate language for its message text.

1.2 The ULTRIX Internationalization Solution
The ULTRIX Internationalization solution consists of:
•

Message catalogs and associated tools

•

A set of library routines

•

Internationalized interface definitions of standard C library routines

•

An announcement mechanism

•

Language support databases

•

An international compiler for the database

The message catalogs are simple databases that enable the program messages to be
held externally to the program. The tools are used to assist in the extraction and
translation from one language to another of the message text, and to generate
message catalogs.
The set of library routines enables programs to determine cultural and languagespecific data dynamically (for example the format of date and time strings, day and
month names, currency symbols, radix character symbols).
The internationalized interface definitions provide language-dependent character type
classification, conversion from uppercase to lowercase and lowercase to uppercase,
date and time messages, floating point to string conversions, and text collation.
The announcement mechanism identifies the national language, local custom
(territory), and codeset requirements (referred to as "language" in the remainder of
the guide) appropriate to each user for applications at runtime. Language support
databases contain the tables that hold the language-specific data, with one database
for each supported language.
The international compiler (i c) supplied with the internationalization package
compiles the source languages information into the language support databases.

1.2.1

International Keyboard Support
Programmers writing applications that support several languages must take into
account that languages are represented within the system by the characters of one or
more coded character sets. Because of the requirements of different languages, the
coded character sets may vary in both size and representation.
In the international environment, you need to use characters that your coded character
set does not use. You can create characters that do not exist as standard keys on your
keyboard by using compose sequences. A compose sequence is a series of keystrokes
that creates a character. You can create any character from the character set your
terminal or DECterm session (if you are using ULTRIX Worksystem Software) is
currently using.
Depending on your keyboard, you compose characters in either of the following
ways:
•

You use three-stroke sequences for a VT320 keyboard

•

You use two-stroke sequences on all keyboards except the North
American/United Kingdom, the Dutch, and the Norwegian/Danish keyboards,
which all use three-stroke sequences.

1-2 Internationalization Overview

For more information on composing characters, see the hardware manual that came
with your terminal or the DECwindows Desktop Applications Guide if you are using
ULTRIX Worksystem Software.

Internationalization Overview 1-3

The Message Catalog System

The message catalog system allows users to interact with the program in their
language. The program message text is stored in a message catalog separate from the
main body of the program. Message catalogs can be translated into several languages
to meet the language requirements of each user.
This chapter describes:
•

How to create a message catalog

•

How to use the string extraction tools to extract text strings from a program
source file, and to replace the extracted strings with library routines

•

The format of the message text source file produced

•

How to translate the message text strings in the message text source file

•

How to use gencat to produce a message catalog containing the translated
messages

Accessing a message catalog is covered in Section 2.5.1. The access mechanism
retrieves a message catalog at run-time and binds it to a particular program. Each
internationalized program contains a number of library routines. The library routines
retrieve the message text from the message catalog. Library routines are described in
Section 2.5.2.
The routine used for accessing the opened catalogs is catgets. This routine
retrieves messages from a message catalog opened by a call to catopen. The
routine catclose closes an open message catalog.
In the message catalog system, message source files are suffixed by a . IDS f and
message catalog files are suffixed by a . cat.

2.1

Creating a Message Catalog
To create a message catalog:
1.

Write the program, including the program messages.

Use the string extraction tools to extract the message text and put it in a
message text source file.

Translate the message text source file into the required national languages using
trans.

Pass the message text source files through gencat to create the message
catalogs.

All these steps are described in this chapter. Any text editor can be used to create the
program source file.

You can combine Steps 1 and 2 if the source program includes the calls to the
message catalog retrieval functions. In this case, the catgets or catgetrnsg
routines should be included in the source file as appropriate. The message text string
can then be extracted using a stream editor and stored in the message text source file.
Message catalogs can be divided into one or more sets of program messages, each set
containing one or more messages. The library routines allow programs to access
messages within message sets.
The internationalization tools used to create a message catalog are:
•

extract for interactive message string extraction

•

strextract for batch message string extraction

•

strrnerge for batch message source file merging (used in conjunction with
strextract)

•

trans translation tool

•

gencat message catalog generator

Each of these tools is described on the relevant reference page, for example
extract(lint) utility. The information in this section about these tools supplements
that contained on the reference pages.

2.2 String Extraction
You can use the string extraction tools to partially automate the process of
internationalizing a C program. For example, you could use them to change the
following segment from a C program:
printf("hello world\n");

into
printf(catgets(cat, 1, 1, "hello world\n"));

and the corresponding message text source into
$quote "
$set 1
1 "hello world\n"

There are two ways to extract text strings from a particular program source file and to
replace the extracted strings with library routines:
•

Use the interactive extraction tool extract on its own

•

Use the batch extraction tool strextract followed by the batch merging tool
strrnerge

In both cases the extracted message text is stored in a message source file suffixed
.rnsf. The message text can then be translated using the trans translation tool.
The translated messages in the source file are submitted to gencat to generate a
message catalog. At run-time, the library routines in the internationalized program
retrieve the translated text from the message catalog.
The interactive and batch methods of string extraction use the following files:
•

A pattern file

2-2 The Message Catalog System

•

An optional ignore file

•

An internationalized source program file (prefixed n 1 ) that is generated during
the internationalization process
-

•

An intermediate file (suffixed. rnsg) that is created in your directory and that
can be referenced by other utilities

•

A message text source file containing the extracted and translated text strings
(suffixed . rns f) generated during the internationalization process

The pattern file is used to determine which strings are matched for the program being
internationalized. This system-wide file is used by the extraction tools. Pattern files
are described on the patterns(5int) reference page and in the file
/usr/lib/intln/patterns.
The ignore file is used to instruct the string extraction tools to ignore specific strings
in the source file. Each line in the ignore file contains a single string which is
compared against the strings matched by the pattern file.
The format of the message text source file is described in Section 2.3. The use of the
gencat tool is described in Section 2.4 and the gencat(lint) reference page.
The string extraction tools produce these files:
•

An internationalized program source file that has had the text strings removed
and replaced with calls to a message catalog access routine

•

A message text source file, containing the text strings removed from the original
program source file, for use as input to gencat after translation of the text

2.3 Format of the Message Text Source File
This section describes the format of a message text source file. Message text strings
can be specified using either message numbers or mnemonics. Note that the fields of
a message text source line are separated by a single ASCII space or tab character.
Any other ASCII spaces or tabs are considered to be part of the subsequent field.

2.3.1

Set and Message Numbers
Message catalogs can be divided into one or more sets of program messages that are
grouped together by a set number. The set number is a parameter of catgets and
catgetrnsg.
You specify the set number of following messages until the next $set, $delset,
or end-of-file, by using the construct:
$set n comment

The n denotes the set number which must be presented in ascending order within a
single source file but need not be contiguous. Any string following the set number is
treated as a comment. There must be at least one $set directive in a message text
source file before any messages.
If you are using message numbers (numeric format), you delete the entire message set
from an existing message catalog using the construct:
$delset n comment

Any string following the set number is treated as a comment.

The Message Catalog System 2-3

To place comments in the message text source file, type a line beginning with a
dollar symbol ( $) followed by an ASCII space or tab character and then the
comment:
$

comment

To define message numbers, use the construct:
m

message-text

The message-text is stored in the message catalog with message number m and
the set number specified by the last $set directive. If the message-text is
empty, and an ASCII space or tab field separator is present, a null string is stored in
the message catalog.
Note that catgets and catgetmsg do not distinguish between a null message and
an undefined message; in both cases these routines return a pointer to the null string.
Message numbers within a single set need not be contiguous, although they must be
in ascending order. The length of message-text must not exceed the number of
characters specified in the nl_textmax field of the file
/usr/include/limits.h.
You can use an optional quote character c to surround message-text so that
trailing spaces are visible in a message source line. You specify this by:
$quote

By default, or if an empty $quote directive is supplied, no quoting of messagetext is recognized.
If a quote character is defined, all white space between the message number and the
quote is ignored. Empty lines in a message text source file are always ignored.
Text strings can contain the special characters and escape sequences. Escape
sequences recognized by gencat are defined in Table 2-1.

Table 2-1: Escape Sequences Recognized by gencat
Description

Symbol

Sequence

newline
horizontal tab
vertical tab
backspace
carriage return
form feed
backslash
octal value

NL(LF)

\n
\t

\b
\r

CR
FF
\
ddd

\f
\\
\ddd

The escape sequence \ ddd consists of a backslash followed by 1, 2, or 3 octal digits
which specify the value of the desired character. If the character following a
backslash is not one of those specified, the backslash is ignored. You also use a
backslash to continue a string on the following line. Thus, the following two lines
describe a single message string:
1 This line continues \
to the next line

which is equivalent to:

2-4 The Message Catalog System

1 This line continues to the next line

The backslash must be the last character on the line that is to be continued.
Further localization is provided by translating the strings contained in the message
text source file into the required languages, and by using gencat to create the
various language message catalogs.

2.3.2

Mnemonics
Sets and messages can be given mnemonic names as an alternative to set and
message numbers. A mnemonic is defined as any string starting with an alphabetic
character. You cannot use mnemonics together with set and message numbers in the
same source file.
In the following example, the mnemonic SET_I, HELLO and BYE are used instead
of the numbers I, I and 2 respectively:
$set SET_l
HELLO Hello world
BYE Goodbye world

The call
catgets (catd, SET_I, HELLO, "")

would return the message:
Hello world

The -h flag of the gencat tool forces the generation of a header file containing
#define statements. You must include #define statements in the program
source files when you use mnemonics. Using the previous example as a basis, the
following code fragments compare two programs, one using mnemonics and the other
using message numbers:
•

Using mnemonics:
#include "prog.h"

catgets(catd, SET_1, HELLO, "Hello");

•

Using message numbers:

catgets(catd, 1, 1, "Hello");

The Message Catalog System 2-5

The contents of the . ms f message file used by the mnemonic program is of the
form:
$quote "
$set SET_l
HELLO "Hello world"

Note, only the text within the quotes should be translated.
The header file generated using gencat -h contains the following:
#define SET 1 1
#define HELLO 1
#define BYE
2

In all other respects, the use of mnemonics does not change how the
internationalization tools are used.
There are some restrictions on the use of mnemonics:
•

Set and message mnemonics cannot have the same name.

•

Catalogs cannot be merged using gencat. An old catalog is always
overwritten by the new catalog.

•

Mnemonics and set and message numbers cannot be combined in the same
source file.

2.4 Using gencat
The gencat program takes a message text source file and either produces a new
message catalog or merges the new message text into an existing message catalog.

•

If the message catalog has already been created, and set and message numbers
are being used, gencat merges the set and message numbers with the existing
message catalog.

•

If the message catalog does not exist, gencat creates it.

If a message text source file uses mnemonics, gencat does not merge the files. The
new file overwrites the original file.
Set and message numbers are described in Section 2.3.1, and mnemonics are
described in Section 2.3.2.
An example of the use of gencat is:
gencat catfile msgfile

where catfile is the name of the target message catalog and msgfile is the
name of a message text source file. If cat f i 1 e exists, then the messages and sets
defined in InS g f i 1 e are added to cat f i 1 e. If set and message numbers collide,
the new message text given in msgfile replaces the existing message text
contained in catfile. If catfile does not exist, gencat creates it.
The software developer uses the gencat -h to produce the header file defining the
mapping between the mnemonic message identifiers and the numbers required by
catgets and catgetmsg.

2-6 The Message Catalog System

The sequence of operations needed to create an internationalized source file and a
translated message catalog is shown in Figure 2-1.

Figure 2·1: Creating a Message Catalog
Source file
(prog.c)

I ~lIJ ~~

...

--.....
~

Ignore file

strextract

extract

lIJ

- Patterns tile -

source.msg
(prog.msg)

I
Edit.msg tile

I
strmerge

Patterns file

~
~

...

·"I~

I
source.mst
(prog.mst)

nLsource
(nLprog)

translate
(using trans)

I
edit
nLsource

gencat

lIJH~I--__----J1

I
source. cat
(prog.cat)

compiler
(cc)

I
a.out

....

[l] = Internationalization tool use
ZK-0045U-R

The Message Catalog System 2-7

The C program (prog.c) is changed into an internationalized source program
(nl_prog) with the text strings removed and replaced with calls to the message
catalog retrieval routines. This is done by using either the interactive extraction tool
extract, or by using the batch extraction tool strextract followed by the batch
merging tool strrnerge.
The message text source file produced (prog.mst) is translated using the translation
tool trans. A message catalog containing the translated messages (prog.cat) is then
produced using the gencat tool.

2.5 Library Routines
This section describes the library routines used to open and close message catalogs
and to extract information from within an open catalog. The library routines are as
follows:
•
•

cat open
catc10se

•

catgets

To compile a C program, use the -1 i option to include the internationalization
library, as shown in the following example:
cc -0 prog prog.c -Ii

2.5.1

Using catopen
Message catalogs are opened for use by calling the library routine catopen, which
locates the identified message catalog according to the search and naming rules
defined in the environment variable NLSPATH. Refer to environ(5int) for details
of this environment variable. The following shows an example of calling the
catopen routine:
catd = cat open (argv[O], 0);

If successful, catopen returns a catalog-descriptor of type nl_catd which is used
on subsequent calls to catgets and catgetrnsg to identify the prepared message
catalog. Message catalogs are closed by calling the library routine catclose.

2.5.2 Using cat gets
The routine catgets retrieves a numbered message from a numbered message set
in the message catalog identified by the cat d argument. The following shows an
example of calling the ca tget s routine:
char *catgets (catd, set_num, msg_num, s)

In this example, the set _ n urn argument is the number of the message set containing
the message rnsg_nurn, and s is a pointer to the default message string. If
catgets retrieves the message successfully, it returns a pointer to the message text
to the caller. If the call is unsuccessful because the message catalog identified by
catd is unavailable, then catgets returns s. Ifmsg_num is not contained in the
message catalog identified by catd, catgets returns the null string.

2-8 The Message Catalog System

All buffer handling and allocation of storage space (for holding the text of a program
message) is performed internally by catgets. For example, the following C source
program uses catopen and catgets to retrieve messages from the message
catalog identified as prog:
#include <stdio.h>
#include <nl_types.h>
#define NL_SETN 1
main ()
{

nl_catd catd = catopen ("prog", 0);
printf ("%s\n", catgets (catd, NL_SETN, 1, "hello world"»;
catclose (catd);

Default message strings enable the text for one language to be kept with the program
as an aid to readability. Alternatively, they can be used to allow application
programs to continue working predictably when specific localizations of the message
text are unavailable. For example, if the above program were invoked from the c
shell as follows:
$ setenv LANG FRE_FR.8859
$ prog

and assuming that the French message text for prog was undefined on the system,
then the above invocation of prog would cause the default message string to be
displayed:
hello world

2.6 Using trans
The translation tool, trans, assists in the translation of source message catalogs.
The command reads input from file .ms! and writes its output either to a file named
trans .rosf or to a file you name on the command line. The command displays
file .ms! in a multiple window screen that lets you simultaneously see the original
message, the translated text you enter, and any messages from the t ran s command.
A full description of the trans tool and the associated editor is contained on the
trans(lint) reference page.
Message catalogs can also be translated using a standard text editor.

The Message Catalog System 2-9

Program Localization

This chapter discusses the following topics:
•

The announcement mechanism, which announces the language and cultural
requirements of the program to the system

•

The announcement categories

•

How to set the program locale

•

How to set categories to the default defined for the implementation

An internationalized program localizes its run-time behavior for a particular language,
territory, and codeset by establishing the required localization data in the program's
locale. You establish the localization data by calling the set locale library routine,
as shown:
setlocale (category, locale)

The category argument is a constant defined in <locale. h>. The following shows
possible values for category:
LC_ALL

Affects all of the following categories

LC_COLLATE

Affects the behavior of the string collation library routines
strcoll(3) and strxfrm(3)

LC_CTYPE

Affects the behavior of the character-handling library routines
conv(3) and ctype(3)

LC_NUMERIC

Affects the radix and thousands separator character in the
formatted input/output library routines printf(3int) and
scanf(3int). LC_NUMERIC also affects the conversion
library routines atof(3) and ecvt(3)

LC_TIME

Affects the behavior of the time library routine strftime(3)

LC_MONETARY

Affects the currency string in the library routine
nl_langinfo(3int)

The locale argument is a pointer to a character string containing the required setting
of category in the following format:
language[_territory[.codeset]] [@modifier]

You can define language, territory, and codeset for all settings of category, and you
can define an @modifier for all categories except LC_ALL.
The following preset values of locale are defined for all settings of category:
"C"

Specifies the standard environment for the C language. If set locale
is not invoked, the C locale is the default.

Specifies that the setting of the locale is obtained from the corresponding
environment variables. Obtaining the locale setting from environment
variables is fully explained in Section 3.5.

NULL

Directs set locale to query category and return the current setting of
locale. You can use the string set locale returns only as input to
subsequent set locale calls.

To use set locale to obtain the locale for all categories from environment
variables, do the following:
set locale (LC_ALL, flfI)

You can also define a locale setting for a specific category. To define a specific
category, you pass the locale setting directly in the setlocale call, as shown:
set locale (LC_COLLATE, "FRE_FR.MCS")

This example specifies collation appropriate for the Digital Multinational Character
Set (MCS) in France.
If you need to define a category more precisely than is possible using language,
territory, and codeset, you can use @modifier. The following example shows a
category definition that uses @modifier:
set locale (LC_COLLATE, flFRE_FR.8859@CCOLL")

In this example collating is done according to the collation table, CCOLL, defined in
the FRE_FR.8859 database, rather than the default collation table.
Preferably, you can obtain the locale for the LC_COLLATE category from the
corresponding environment variable as follows:
set locale (LC_COLLATE, "")

3.1

The Announcement Mechanism
When an internationalized program is run, the language requirements of the program
must be announced to the system.
You define the environment variable $ {LANG} to identify which language,
territory, codeset, and modifier a program requires. You can define a unique value
of $ {LANG } for each supported language, territory, codeset, and modifier
combination. If you define $ {LANG} settings for different language, territory,
codeset, and modifier settings, each definition might be associated with a different
instance of collating sequence, character conversion, character classification,
langinfo tables, and message catalogs.
The ${LANG} variable contains the required language, territory, codeset, and
modifier names in English as follows:
language[_territory[.codeset] [@modifier]

The length of the entire string should not exceed the value of NL _ LANGMAX located
in /usr / include/ limi ts. h. The set of characters, excluding separators, is
restricted to the ASCII set of alphanumeric characters. Language support databases
and naming conventions are shown in the lang(5int) reference page.

3-2 Program Localization

On its own, language selects the required native language. You can specify
_territory or _territory.codset if you need to be more specific than native language.
The following examples demonstrate defining the LANG variable:
•

Example 1
LANG=FRE

This example selects a database that supports the French native language.
•

Example 2
LANG=FRE FR

This example selects a database that supports the French native language, as it
is spoken in France (rather than Canada).
•

Example 3
LANG=FRE FR.MCS

This example selects a database that supports the French native language, as
spoken in France, and the Digital MCS. You cannot specify the Digital MCS
unless you specify a _territory, in this case "_FR."
If the files FRE and FRE FR are linked to the FRE_FR.MCS database, Example 1,
Example 2, and Example-3 refer to the same database.

For information on creating a language support database, see Chapter 4.

3.2 Announcement Categories
The general announcement mechanism by which users can identify overall
requirements for program localization is provided by the environment variable
${LANG}. This is sufficient when a single localization covers the user's
requirements for text collation, character classification, and message presentation.
Selective modification of the international environment can be achieved by defining
additional environment variables, one for each permitted setting of category, except
LC_ALL. (For more information, see the setlocale(3) reference page.) The
permitted categories are: LC_COLLATE, LC_CTYPE, LC_NUMERIC, LC_TIME
and LC_MONETARY. If any of these are not defined in the current environment,
LANG provides the necessary defaults.
LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, and LC_TlME are
also defined to accept an additional field, @modi fie r, which enables you to select
a specific instance of localization data within a single category (for example, for
selecting dictionary-ordering of data as opposed to character-ordering of data). For
example, if you want to interact with the system in French, but are required to sort
German text files, you could define LANG and LC_COLLATE as follows:
LANG=Fr_FR
LC_COLLATE=De_DE

You could extend this definition to select, for example, dictionary ordering by using
the @modifier field, as follows:
LC_COLLATE=De_DE@dict

Program Localization 3-3

3.3 Setting the Program Locale
There are three ways to set the program locale using the set locale library routine:
set locale ( category, string)
This usage sets a specific category in the program locale to a specific value of
string, for example;
set locale (LC_ALL, "FRE_FR.MCS");

In this example, all categories of the program locale are set to the locale
corresponding to the string FRE_FR.MCS, or the French language as spoken in
France, using the Digital MCS. The string FRE_FR.MCS is used to locate the
appropriate database. For more information, refer to lang(5int) reference page.
If string does not correspond to a valid setting of locale, set locale returns a
null pointer and the program locale is not changed. Otherwise, set locale
returns the name of the locale.

set locale ( category, "e" )
This usage resets the default environment for the C language.
set locale ( category, " " )
This usage sets category to correspond to the setting of the associated
environment variable and is described in Sections 3.4 and 3.5.
By default, the directory /usr / lib/ intln contains the language support
databases. If you intend to place your language support databases in another
directory, you specify the directory path with the INTLINFO environment variable.

3.4 Setting a Specific Category
This use of setlocale allows one of either LC_COLLATE, LC_CTYPE,
LC_NUMERIC, LC_TIME or LC_MONETARY to be set individually. For example:
set locale (LC_COLLATE, "");

Here, set locale first checks the value of the corresponding environment variable,
$ {LC_COLLATE}. If the value contains the name of a valid locale, setlocale
sets the specified category to that value and returns its name. If the value is invalid,
set locale returns a null pointer and the program locale is not changed.
If the environment variable corresponding to category is not set or is the empty
string, setlocale examines ${LANG}. If ${LANG} is set and contains the name
of a valid locale, that value is used to set category. Otherwise, setlocale returns
a null pointer and the program locale is not changed.
On ULTRIX, the implementation defined default is the C locale.

3.5 Setting all Categories
This use of set locale is similar to that described in Section 3.4, except that here
set locale examines all the environment variables to determine what values to set.
In this case, set locale is called as follows:
set locale (LC_ALL, "")

Here, set locale first checks all the environment variables. If they are valid,
set locale initializes each category to the value of the corresponding environment

3-4 Program Localization

variable. If any environment variable is invalid, set locale returns a null pointer
and the program locale is not changed.
Categories are initialized in the following order, where ${LANG} is used to initialize
category LC_ALL:
1.

LC_ALL

LC_CTYPE

LC_COLLATE

LC_TlME

LC_NUMERIC

LC_MONETARY

Using this scheme, environment variables corresponding to specific categories
override the setting of $ {LANG} .
If a category-specific environment variable is not set, or is set to the empty string,
that category is not overwritten (that is, it assumes the setting of $ {LANG}). If
${LANG} is not set, or is set to the empty string, set locale returns a null pointer

and the program locale is not changed. This is the default.
On ULTRIX, the implementation defined default is the C locale.

3.6 The C Locale
In the C locale, all characters are encoded in 7 bit ASCII. Also, characters are
collated in machine order. The C locale is guaranteed to exist on all X/Open and
POSIX compliant systems. Table 4-3 shows how national language strings are
returned in the C locale.

3.7 Internationalized Program Example
The following is an example of an internationalized C program. This program,
ida t e . c, displays date and time for a specified locale. The associated header and
message files are shown following the source program.
/*

* idate: display date and time in locale specific format

* Sample internationalized application. This program uses the *
* mnemonic format for message catalogs to enhance maintainability *
*/
#include <sys/time.h>
#include <langinfo.h>

#include <locale.h>
#include <nl_types.h>
#include "idate.h"

/* default strings for date/time *
formats, etc.
*/
/* declarations used by set locale */
/* declarations for message catalog system */

/* generated by gencat, contains message *
identifiers */

Program Localization 3-5

struct timeval tp;
struct timezone tpz;
main (argc, argv)
int argc;
char *argv[];
{

char
struct

timestring[50];
tm *tms;

/* open message catalog - look in current directory */
catd = catopen("idate.cat", 0);
/* check command line arguments */
if (argc > 1) {
printf(catgets(catd, IDATE_SET1, USE_MSG, "usage: incorrect\nn));
exit(l);
/* initialize runtime locale */
if (setlocale(LC_TIME, "") == (char *)0) {
printf(catgets(catd, IDATE SET1, LOCALE MSG, nidate: cannot change \
locale - check environment-variables\n"));

/* get time from system clock */
time(&tp.tv_sec);
tms = localtime(&tp.tv_sec)i
/* do 118N conversion */
strftime(timestring, sizeof(timestring), nl_Ianginfo(D_T_FMT), tms);
printf("%s %s\n", catgets(catd, IDATE_SET1, TIME_MSG, \
"Local time: "), timestring);
/* close message catalog */
catclose(catd);

The following is the contents of the header file for idate:
/*
* idate.h: header file created by gencat -h idate.h
* idate.cat idate.msf

*/
#define IDATE SET1
#define USE MSGO
#define LOCALE_MSG
#define TIME MSG:

3-6 Program Localization

0
1
2

/* set name */

The following shows the contents of the message file ida t e . InS f that is used in
conjunction with idate. c:
$ idate.msf
$ This is the sample message file for use with the program
$ idate.c. Note the syntax of each line with a directive.
$ Note also that blank lines are accepted as input
$ When using mnemonic format for messages you are required
$ to use a quote character and to quote each message string.
$ This file can be used as input to the trans utility.
$ trans provides a simple user interface to aid the
$ process of message text translation.
$quote "
$set IDATE_SETl
USE_MSG "usage: idate\n"
LOCALE_MSG "idate: cannot change locale, check environment variables\n"
TIME_MSG: "Local Time: "
$ End of idate.msf

Program Localization 3-7

Language Support Databases

The language support databases are used to hold various language dependent entities,
and to free programs from national language dependencies. There is one language
support database for each national language used on the system. The information in
the language support databases is supplied through database language source files
which enable the national language and codeset characteristics to be defined. The file
comprises definitions for the following:
•

Codeset

•

Property table

•

Collation table

•

String tables

•

Conversion tables

The international compiler converts these tables into an efficient binary representation
suitable for use by run-time functions. The international compiler is described on the
ic (lint) reference page.
The following general considerations apply to the database language source file:
•

The database source should only contain ASCII characters.

•

The source is free format, so "white space" has no significance other than as a
separator for tokens in the input.

•

You can use C-style comments and macro definitions, in particular the
# incl ude and ide fine facilities.

By default, the language support database files are held under /usr / lib/ intln.
The source language and the format of the source files is illustrated in Appendix B.
Example 4-1 shows the basic structure of the source file. All definitions are
terminated with the' 'END." sequence.

Example 4-1: Structure of the Source File
CODE SET ENG GB.MCS
/* -

* codeset definition and default property table
*/

END.
COLLATION :
/*

* default collation table

END.
STRINGTABLE
/*

Example 4-1:

(continued)

* default string table
*/
END.
CONVERSION toupper
/*
* lowercase to uppercase conversion table
*/
END.
CONVERSION tolower
/*
* uppercase to lowercase conversion table
*/
END.

4.1

The Codeset Definition
The codeset defines the valid characters and their properties within the language. For
example, it could specify that "A" is a valid character in the English language,
possessing lowercase and hexadecimal properties.
The definition of the codeset being used starts with the keyword CODESET followed
by the codeset name double letters. For example, e in IS06937 is replaced by the
sequence e'.
Once compilation is successful, the name given to the codeset becomes the name of
the binary file. In most cases, this name is in the following format:
language_[territory[.codeset] [@modifier]]

You can specify the name of the codeset on the i c command line using the -0
option. If you specify a name on the command line, the name you specify supersedes
the name of the codeset in the database source file.
After the keyword assignment, each code is defined by assigning the value of the
code to an identifier. This identifier can be used to reference the code from then on.
This assignment has the form:
Identifier '=' value_list

, . , Properties ] , ; ,

For example:
a = 'a'

: LOWER, HEX;

The value_list is a list of values separated by commas. A value may be given
as a C-style character constant (' '), in octal (Onnn), hexadecimal (Oxnnn), decimal
(nnn), ISO notation (mm/nn), or by giving the name of a previously defined code.
Codes may be either simple or combined. However, several restrictions must be
observed when defining codes in the CODESET section:
•

The list of simple codes must contain all codes from code value OxO up to and
including the code with the highest value defined. The order of definition is not
important, since all code values are sorted into ascending collation order when
the whole codeset definition has been read.

•

The list of simple codes may not contain codes with duplicate code values.

•

There may be up to 2 15 definitions for multi-byte codes. Combined codes need
not have contiguous code values and will be sorted in ascending machine
collation order and construct the "double letter table" in the compiled database.

4-2 Language Support Databases

•

There must be only one definition of a codeset, and that definition must be the
first item in the source file.

The optional properties part of the definition assigns default properties to a code.
If it is not given, the code is assumed to be defined but illegal. This is useful for
languages that do not require all the letters defined in a standard code set. Properties
take the form of a list of keywords separated by commas.
A third kind of statement allowed in the CODESET section is the
(re-)assignment of default properties to an already defined code. This statement takes
the form of
Identifier':' Properties ';'

The use of the # incl ude facility provided in the language is strongly recommended
as most of the codes considered contain common code (for example ASCII or
IS0646) in their lower half. Using a common incl ude file reduces the risk of error
and provides a common name basis for the remainder of the source.

4.2 The Property Table
The property table contains the mapping between characters in the codeset and
classification. Each character code from the coded character set is used to index an
entry in the relevant language property table. Each entry in the property table
contains a series of flags identifying whether a particular language assertion is true or
false. The character may possess any of the following attributes:
•

Undefined

•

Uppercase alphabetic

•

Lowercase alphabetic

•

Punctuation

•

Control

•

Blank

These can be accessed at run time by the ct ype library routines.
There can be more than one property table. Each property table is introduced by the
keyword PROPERTY. The default property table, built along with the code set, has
the predefined name PROP_DFLT. The property table must not be redefined. Names
of property tables must be unique throughout the source.
A statement in the property table takes the form of:
Identifier':' Properties';'

where Identifier designates a defined code and Properties is a list of
properties separated by commas. For example:

c: UPPER, HEX;
Some properties effect the interpretation of characters by various other
internationalization library routines. For example, the property DIPHTONG must be
set for diphthongs to collate correctly as diphthongs, and the property DOUBLE must
be set to recognize correctly the first of a double-letter sequence.

Language Support Databases 4-3

The full1ist of properties is shown in Table 4-1.

Table 4-1: Properties and Character Classification
Property

Character Classification

ARITH
BLANK
CTRL

arithmetic sign
blank character
control character
currency character
diacritical sign
diphthong
double letter
fraction character
illegal character
lowercase letter
miscellaneous symbol
punctuation character
space character
superscript or subscript
uppercase letter

CURENCY
DIACRIT
DIPHTONG
DOUBLE

FRACTION
ILLEGAL
LOWER
MISCEL
PUNCT
SPACE
SUPSUB
UPPER

The corresponding code to the property DOUBLE is constructed from two other
single-byte codes, but it is treated as a single code. This treatment allows the
following:
•

The expansion of 8-bit character sets to allow double letters (for example LI or
11 in Spanish) that collate two-to-one

•

The handling of 8/16 bit codes like IS06937/l, which is the character "e"

The corresponding code to the property DIACRI, for example, is a diacritical sign. If
combined with either UPPER or LOWER, the corresponding code is a diacritical
letter.
The meaning of diphthong in internationalization is somewhat different from the
definition used in the grammar of languages that use diphthongs. Diphthong, for the
purposes of internationalization, is defined as a character for which one-to-two
collation must be used. This implies an interdependence with the collation tables.
The properties of a code can be redefined by the user since oniy the definition in
effect upon reaching the end of the property table will be put in the binary file.
A code with no defined property will be listed as ILLEGAL in the resulting property
table.

4.3 The Collation Table
Collation tables define the collating sequence for each supported language. The
binary values of characters in the associated coded character-set are used as indices
into the table. Individual entries are used to indicate the relative position of that
character in the language collating sequence. The package supports the following:

4-4 Language Support Databases

•

One-to-one character mappings, such that "a" collates before "b," and so on.

•

One-to-two character mappings, where certain characters are treated as two
characters. For example, the German sharp "s," becomes "ss" for collating.

•

Two-to-one character mappings, where certain character sequences are treated as
a single character in the collating sequence. For example, "ch" and "11" in
Spanish are collated after "c" and "1" respectively.

•

No preference characters, where certain characters are ignored by the collating
sequence. For example, if "-" is defined as a no preference character, then the
strings "re-locate" and "relocate" are equal.

These capabilities provide support for collating algorithms which cater for case and
accent priority, where for example, two characters are first compared for equality,
ignoring accents, and if equal are then ordered by accent sequence. Collating
algorithms of this type gives a dictionary ordering of data. The dictionary ordering
of data within the internationalization package is the same as for a normal dictionary
in the language being considered. Telephone book ordering is the same as for a
telephone directory in the supported language. It should be noted that both dictionary
and telephone book ordering may be subject to local variation.
The default collation table is introduced by the keyword COLLATION, and is named
COLL_DFLT. The default table must exist for ic to compile the database. Other
collation tables can be introduced by the keyword COLLATION, followed by the
name of the table. Names of collation tables must be unique throughout the source.
A statement in the collation section may take one of the following forms:
•

PRIMARY ':' Ident_list ';' for example,
PRIMARY: a, A, b, B;

•

PRIMARY ':' Ident '-' Ident ';' for example,
PRIMARY: a-z;

•

PRIMARY':' REST ';' for example,
PRIMARY: REST;

•

EQUAL ':' Ident_list ';'

for example,

EQUAL: a,A;

•

Ident '=' '(' Ident ',' Ident ')' ';' for example,
PRIMARY: ae = (a, e);

•

PROPERTY':' Property_table_name ';' for example,
PROPERTY: newprop;

The order of statements in the collation section is significant. All of the statements
(except the last) open a new class of codes with primary and secondary weights. The
primary weight is set by the position of the PRIMARY or EQUAL statement, with
all the codes named in the statement having the same primary weight. For example,
the sixth PRIMARY statement in a collation section would assign the primary weight
6 to all the codes listed. Primary weights start at 1 and increase by one for each
statement encountered up to a limit of 254. The secondary weight of the codes. is
governed by their ordering within a set, except codes with an EQUAL statement,
Language Support Databases 4-5

which all have the same secondary weight. The limit on secondary weights is 255.
The statement PRIMARY':' Ident_list ';' assigns the named codes ascending
secondary weights from left to right.
The statement PRIMARY ':' Ident '-' Ident ';' assigns ascending secondary weights
for ascending machine collation order to the named codes.
The statement PRIMARY':' REST ';' sets the primary weight of codes not explicitly
named in the collation section. The secondary weight of the codes is set to ascending
machine collation order. This is a convenient notation for defaulting unspecified
codes to collate after or before all others.
The statement EQUAL ':' Ident_list assigns the same PRIMARY and SECONDARY
weight to all codes in the list.
The statement Ident '=' '(' Ident ',' Ident ')' ';' is reserved for the collation of
diphthongs (one-to-two collation). It implies that the left hand side code collates as
if it were the first right hand code followed by the second right hand code.
In order for the diphthong collation to work correctly, the code named on the left
hand of the statement must be marked as DIPHTONG in at least one property table.
If this property table is not the default table, the statement PROPERTY':'
Property_table_name ';' must be used to identify the property table name to the
compiler. This allows the run-time routines to load a collation-only property table for
use with diphthongs.
Table 4-2 gives three examples of primary and secondary weighting. In Example 1,
all the items have the same primary weight, but have ascending secondary weights. In
Example 2, both primary and secondary weights are used to resolve collation. In
Example 3, all the items have the same secondary weight, but have ascending
primary weights.
If the three alphabetic strings:

•

Abc

•

aac

•

Bbc

were collated using the three examples in Table 4-2, the results would be as follows:
•

Example 1: Abc, aac, Bbc

•

Example 2: aac, Abc, Bbc

•

Example 3: Abc, aac, Bbc

Note that Example 2 is the only way to obtain dictionary collation. Of Examples 1
and 3, Example 3 is the most efficient since only one pass is required. Collation is
resolved on primary weighting, then secondary weighting.

4-6 Language Support Databases

Table 4-2: Examples of Primary and Secondary Weighting

secondary
primary

Example 1
1
2
3
B
1
A
a

4
b

5
C

6
c

Example 2
1
2
1
A
a
2
B
b
C
c
3
Example 3
1
1
A
2
a
B
3
4
b
C
5
6
c

If a code is not given weights in the collation section, it is treated as having the
(otherwise illegal) primary and secondary weight 0 (zero). This results in the code
collating as a "don't care" character.
Double letters (2-to-l collation) must be named in the codeset. They can then be
given a weight in the collation section.
For some examples on collation sequences, refer to Appendix B.

4.4 The String Table
The string table contains the language strings required for formatting date and time,
yes and no, and radix characters. The default string table is introduced by the
keyword STRINGTABLE, and is named STRG_DFLT. The default string table
must exist for i c to compile the database. Other string tables can be introduced by
the keyword STRINGTABLE, followed by the table name. However, names of
string tables must be unique throughout the source.
Each statement in a string table has the form:
Ident '=' value_list' i'

where Ident is an identifier, the name of the string and value_list is a comma
separated list of strings, character constants, and identifiers designating codes. This
allows inclusion of non-ASCII codes in any string table by giving the name of the
code in value list.
Table 4-3 shows the strings that must appear in the string table.

Language Support Databases 4-7

Table 4·3: Mandatory Strings in the String Table
String

Meaning

C locale

Category

NOSTR
YESSTR
D_T_FMT

Negative response
Positive response
Default date and time format

LC_ALL
LC_ALL

D_FMT
T_FMT

Default date format
Default time format

no
yes
%a %b %d
%H:%M:%S %Y
%m/%d/%y
%H:%M:%S

LC_TIME
LC_TIME
LC_TIME

DAY_l
DAY_2

Day name
Day name

Sunday
Monday

LC_TIME
LC_TIME

DAY_7

Day name

Saturday

LC_TIME

ABDAY_l
ABDAY_2
ABDAY_3

Abbreviated day name
Abbreviated day name
Abbreviated day name

Sun
Mon
Tue

LC_TIME
LC_TIME
LC_TIME

ABDAY_7

Abbreviated day name

Sat

LC_TIME

MON_l
MON_2
MON_3

Month name
Month name
Month name

January
February
March

LC_TIME
LC_TIME
LC_TIME

MON_12

Month name

December

LC_TIME

ABMON_l
ABMON_2

Abbreviated month name
Abbreviated month name

Jan
Feb

LC_TIME
LC_TIME

ABMON_12

Abbreviated month name

Dec

LC_TIME

RADIXCHAR
THOUSEP
CRNCYSTR
AM_STR
PM_STR
EXPL_STR
EXPU_STR

Radix character
Thousands separator
Currency format
String for AM
String for PM
Lowercase exponent character
Uppercase exponent character

AM
PM
e
E

LC_NUMERIC
LC_NUMERIC
LC_MONETARY
LC_TIME
LC_TIME
LC_NUMERIC
LC_NUMERIC

4.5 The Conversion Tables
The conversion tables are used to convert characters within the codeset, for example
uppercase converted to lowercase. There must be at least two conversion tables
within the database language source file. These are named toupper and tolower and
are used to convert characters to uppercase and to lowercase respectively.
A statement in a conversion table takes one of three forms in which Ident specifies
a code defined in the codeset, and con ve r s ion_val u e specifies the code or string
value that the left hand side should be converted to.

4-8 Language Support Databases

•

Ident '-> ' conversion_value ';'
For example: a -> A;

•

Ident '-' Ident '->' Ident '-' Ident ';'
For example: a-z -> A-Z;

•

DEFAULT '->' default_value ';'
For example: DEFAULT -> SAME;

The default value for a conversion may be given using the DEFAULT statement. Any
code without a specified conversion, maps to the given value. There are two
predefined values possible in a DEFAULT statement:
•

VOID, which means that all other codes convert to either the ASCII NUL code
(in the case of a code conversion) or to an empty string (in the case of a string
conversion).

•

SAME, which means that a code is converted to itself if there is no explicit
conversion given. This default conversion is not valid for string type
conversions.

The range notation in the conversion section implies an underlying machine collation
sequence and is only valid for code conversions where such a collation sequence is
always defined.
If no DEFAULT clause is given, the default clause is assumed to read
DEFAULT -> VOID ;

Some examples of both types of conversion are given in Appendix B.

Language Support Databases 4-9

Database Source Language Syntax
Description

This appendix describes the database source language you use to create a source file
for a language support database. The appendix explains the syntax elements of the
source files and gives an Extended Backus-Naur Form (EBNF) notation of the syntax
recognized by the ic compiler.

A.1

Rules for Building Identifiers
The rules for building an identifier (Ident) are as follows:
•

Each identifier must start with a letter or a hyphen ( - ).

•

An identifier can be any length and can contain letters (a to z and A to Z), digits
(1 - 9), hyphens ( - ), and periods (.).

•

If you use a period in an identifier, at least one letter, digit, or hyphen must
follow the period.

A.2 Rules for Building Strings
The rules for building a string (String) are as follows:
•

No string can contain more than 255 characters.

•

Each string must be enclosed in quotation marks (" ").

•

Each string must be on one line in the source file.

•

A string can contain the following escape sequences:
\n -

\r \t \b \f \\ \" -

ASCII newline
ASCII return
ASCII tabulator
ASCII backspace
ASCII form feed
escaped backslash
escaped double quotes

A.3 Rules for Building Constants
A constant (Constant) can be any of the following forms:
•

A character constant, such as one character enclosed in single quotation marks
(' '). You can use escape sequences within a character constant by following
the C language rules for using escape sequences. For information on those
rules, see the Guide to VAX C.

•

A hexadecimal constant of the form Oxnnnn, where n designates a hexadecimal
digit (0-9, a to f, and A to F). The hexadecimal constant must be in the range of
o to Ox7FFF. You can omit leading null valued digits.

•

An octal constant of the form Onnnn, where n designates an octal digit (0-7).
The octal constant must be in the range of 0 to 077777. You can omit leading
null valued digits.

•

A character in ISO notation nln, where n designates a decimal number in the
range of 0 to 15.

•

A decimal number n, where n is a positive integer in the range 0 to 32,767.

A.4 Rules for Separating Tokens, Specifying Comments, and
Using Directives
You must separate tokens with spaces or horizontal tabs. You must not include
white space within tokens. White space (for example, " ", newline, horizontal tab)
is significant only as a token separator. The ic compiler ignores white space that
you use to make your source file readable.
As in the C language, comments are delimited by pairs of slashes and asterisks
(!*comment*/). You can include comments anywhere in the source file except within
tokens. If you use a comment within a token, the i c compiler considers the token to
end where the comment begins. Any text that follows the comment begins a new
token.
Because the database source file is preprocessed by the C preprocessor, you can use
the preprocessor directives, such as #include, #define, and #if, thoughout the
source file.

A.S EBNF Description
Example A-I contains the EBNF description of the database source language. If you
are unfamiliar with EBNF notation, you can find a description of it in Compilers,
Principles, Techniques, and Tools.l
The notation in this appendix differs from the description in Compilers, Principles,
Techniques, and Tools in the following ways:
•

In productions, nonterminals on the left side are separated from terminals or
tokens on the right side by a colon (:) instead of an arrow.

•

Terminals appear in single quotation marks (' ') or in uppercase characters,
instead of boldface type.

•

The nonterminals Ident, String, and Constant are not described by a production.
These nonterminals are described by the rules in Section A.l, Section A.2, and
Section A.3, respectively.

1 Alfred V. Abo, Compilers, Principles, Techniques, and Tools (Reading, Mass: Addison-Wesley Publishing Co.,
1986), pp. 26.

A-2 Database Source Language Syntax Description

Example A·1: EBNF Description of the Database Source Language
intl_data_base
: codeset table data tables
data_tables
: data table

data tables data table

property_table
collation_table
format table
conversion_table
codeset table
: CODESET Ident

,:'

code- definition- list END

code_definition_list
code definition
I code_definition_list
code definition
Ident '=' code_value
Ident '=' code_value
property_definition
code_value
: code

, ,

, ,. , code definition
, . , property_list

, , , code

code
: Constant I Ident
property_list
: property I property_list
property_table
: PROPERTY Ident

,:'

, , , property

property_definition_list END

, ,

property_definition_list
property_definition
I property_definition_list ' i' property_definition
property_definition
: Ident ' : ' property_list

property
ARITH I BLANK I CTRL I CURENCY I DIACRIT
DIPHTONG I DOUBLE I FRACTION I HEX I ILLEGAL
LOWER I MISCEL I NUMERAL I PUNCT
SPACE I SUP SUB I UPPER
collation_table
COLLATION':' collation_list END'.'
I COLLATION Ident ' : ' collation list END
collation_list
collation

, ,

collation_list 'i' collation

collation
PRIMARY':' code_value_list
PRIMARY':' Ident '-' Ident
PRIMARY ' : ' REST
EQUAL':' code_value_list

Database Source Language Syntax Description A-3

Example A-1:

(continued)

EQUAL':' Ident '-' Ident
EQUAL ' : ' REST
Ident '=' '('
Ident , , , Ident ')'
PROPERTY':' Ident
code- value- list
: Ident

, , , Ident

format table
STRINGTABLE ' : ' format list END'.'
I STRINGTABLE Ident ' : ' format list END
format list
: format

, ,

format list 'i' format

format
: Ident '=' format value
format value
: code_or_string

format value

,,,

code_or_string
: code I String

conversion table
CONVERSION Ident ' : ' conversion list END'.'
I CODE CONVERSION Ident ' . ' conversion list END
conversion list
: conversion

conversion list 'i' conversion

conversion
DEFAULT '->' default value
Ident '->' conversion value
Ident '-' Ident '->' Ident '-' Ident
default_value
: VOID

SAME I conversion value

conversion value
code_or_string
conversion value

,,,

A-4 Database Source L~hguage Syntax Description
/

, ,

Example Source Language File

Example B-1 illustrates the file structure of a source file for a language support
database. The example omits parts of the source file to save space. To see a
complete database source file, display or print one of the source files in subdirectories
of the / u s r / I ib / in t In directory. For example, the source file for the German
database that uses the ISO Latin 1 codeset is in the
/usr/Iib/intIn/8859/GER_DE. 8859. in file.

Example 8-1: Example of a Language Support Database Source File
/*

* example annotated (partial) source for
* a Language Support Database
*/
CODESET CH ASCIIPLUS :
/*- CH_ASCIIPLUS will be the name of the INTLINFO file */
'include "IS0646"
/* include IS0646 as the predefined ASCII code definition */
/*

* additional definitions for demonstration purposes:
*

* first we have a range of secondary control codes.
* This is not enforced by the ic compiler nor by

* the language but is a common IS 2022 style
* code set extension technique. Note that because
* there are no properties defined below all these

* codes are defined but not legal.
*/
scOO
sc04
sc08
scOc

Ox80;
Ox84;
Ox88;
Ox8c;

sc01
scOS
sc09
scOd

Ox81;
Ox8S;
Ox89;
Ox8d;

sc02
sc06
scOa
scOe

Ox82;
Ox86;
Ox8a;
Ox8e;

sc03
sc07
scOb
scOf

Ox83;
Ox87;
Ox8b;
Ox8f;

* NOTE: this gap in the source will prevent compilation.
* This was done to shorten the example.

* now come some more useful code definitions. These
* definitions are taken from the IS 8859/1
* definition. Note the convention of writing
* uppercase letters in all uppercase, lowercase
* letters and special codes in all lowercase.
* Here the codes are defined directly from their
* ISO notation.

*/
A_GRAVE = 12/0
A AIGU = 12/1
A=CIRCON = 12/2
A TILDE = 12/3
DIA_A = 12/4

UPPER;
UPPER;
UPPER;
UPPER;
UPPER;

Example 8-1:

(continued)

A_CIRCLE = 12/5 : UPPER;

* The following declaration of AE as a diphthong enables

* the correct treatment of diphthongs

(one-to-two

* collation) in the default collation.

*/
AE = 12/6

: UPPER, DIPHTHONG;

* NOTE: this gap in the source will prevent compilation.
* This was done to shorten the example.
*/
/*

* lowercase equivalents of the codes defined
* in the last block
*/
a_grave = 14/0
a_aigu = 14/1
a circon = 14/2
a-tilde = 14/3
di"a_a = 14/4
a circle = 14/5
= 14/6

LOWER;
LOWER;
LOWER;
LOWER;
LOWER;
LOWER;
LOWER, DIPHTHONG;

* special double letters for Spanish

* Note that these "characters" are not defined by
* any standard! They represent an extension
* useful to handle the following problems:

two to one collation
conversions toupper and tolower

L, 1

DOUBLE, UPPER;
DOUBLE, LOWER;

11 = 1, 1
END.

* Collation table that shows most of the possible

* problems in collation but does not make very much

* sense in the real world:
*

* Uppercase and lowercase letters are intermixed and
* within one letter the uppercase comes before the

* lowercase letter.
*
* Accented characters sort after their corresponding
* nonaccented base character.

COLLATION
PRIMARY
PRIMARY
PRIMARY:
PRIMARY:
PRIMARY:
PRIMARY:
PRIMARY:
PRIMARY:

A, A_GRAVE, A_AIGU, A_CIRCON, A_TILDE,
DIA_A, A_CIRCLE;
a, a_grave, a _aigu, a....,;circon, a_tilde,
dia _a, a circle;
B; PRIMARY: b; PRIMARY: C; PRIMARY: C;
D; PRIMARY: d; PRIMARY: E; PRIMARY: e;
F; PRIMARY: f; PRIMARY: G; PRIMARY: g;
H; PRIMARY: h; PRIMARY: I; PRIMARY: ii
J; PRIMARY: j; PRIMARY: K; PRIMARY: k;
L; PRIMARY: 1;

B-2 Example Source Language File

Example 8-1:

(continued)

* TWO-TO-ONE COLLATION:
*

* For Ll and 11 Spanish collation rule says that
* this has to be collated after L or 1.

*/
PRIMARY: Ll; PRIMARY: 11;
PRIMARY: M; PRIMARY: m; PRIMARY: N; PRIMARY: n;

* ONE-TO-TWO COLLATION:

*
* The following two codes are diphthongs, that is

* codes that collate as two characters.
*/
AE = (A, E);
ae

(a, e);

* The rest of the codes defined in the codeset will

* collate as don't care characters.
*/
END.

* This is a sample string table based on the German language.

* Note the mixed uses of ASCII strings and identifiers
* specified in the codeset definition.

*
* The strings for CRNCYSTR, D_T_FMT, D_FMT, T FMT are
* typically specified as ASCII strings.
*
* Each of the items specified is required by the ic
* compiler. Additional items can be specified if so
* desired.
*/
STRINGTABLE :
NOSTR
EXPL STR
EXPU_STR
RADIXCHAR
THOUSEP
YESSTR
CRNCYSTR

"nein";
, e' ;

'E' ;
comma;
dot;
"ja";
"+DM";

D T FMT
D FMT
T FMT
AM STR
PM STR

"%a, %d. %b %Y %H:%M:%S"
"%a, %d. %b %Y";
"%H:%M:%S";

DAY 1
DAY 3
DAY_5
DAY 7

"Sonntag";
"Dienstag" ;
"Donnerstag";
"Samstag";

DAY 2
DAY 4
DAY_6

"Montag" ;
"Mittwoch";
"Freitag";

ABDAY 1
ABDAY_3

"So";

"Di";

ABDAY 2
ABDAY 4

"Mo";
"Mi";

"AM" ;

"PM" ;

Example Source Language File B-3

Example B-1 :

(continued)

ABDAY 5
ABDAY 7

"Do";
"Sa";

ABDAY 6

"Fr";

MON 1
MON- 3
MON 5
MON 7
MON 9
MON 11

"Januar";
M, dia_ a, "rz";
"Mai";
"Juli";
"September";
"November";

MON 2
MON_ 4
MON 6
MON 8
MON_10
MON- 12

"Februar";
"April";
"Juni";
"August";
"Oktober";
"Dezember";

ABMON 1
ABMON 3
ABMON 5
ABMON_ 7
ABMON 9
ABMON_11
END.

STRINGTABLE :
MON_l = "January";
YESSTR = "oui";
END.

B-4 Example Source Language File

"Jan";
M, dia a, r;
= "Mai";
"Jul";
"Sep";
"Nov";

ABMON 2
ABMON 4
ABMON 6
ABMON 8
ABMON_ 10
ABMON- 12

"Feb" ;
"Apr";
= "Jun";
"Aug";
"Okt" ;
"Dez";

Associated Reference Pages

This appendix gives a list of the ULTRIX reference pages associated with the
Internationalization package.
iconv(l)

International codeset conversion

extract(lint)
gencat(lint)
ic(lint)
strextract(lint)
strmerge( 1int)
trans( 1int)

Interactive string extract and replace
Generate a formatted message catalog
Compiler for language support database
Batch string extraction
Batch string replacement
Translation tool for use with message source files

atof(3)
conv(3)
ctype(3)
ecvt(3)
setlocale(3)
strcoll(3)
strfiime(3)
strxfrm(3)

Convert ASCII to numbers
Translate characters
Character classification macros
Output conversion
Set localization for internationalized program
String collation comparison
Convert time and date to string
String transformation

intro(3int)
catgetmsg(3int)

printf(3int)
scanf(3int)
vprintf(3int)

Introduction to the internationalization subroutines
Get message from a message catalog (Provided for X/Open
XPG-2 conformance)
Read a program message
Open/close a message catalog
Language information
Print formatted output (Provided for X/Open XPG-2
conformance)
Convert formatted input (Provided for X/Open XPG-2
conformance)
Print formatted output
Convert formatted input
Print formatted output of a varargs argument list

printf(3s)
scanf(3s)

Print formatted output
Convert formatted input

environ(5int)
lang(5int)
nl_types(5int)
patterns(5int)

NLS environment variables
Language names
Language support database types
Patterns for use with internationalization tools

catgets(3int)
catopen(3int)
nl_langinfo(3int)
nl_printf(3int)
nl_scanf(3int)

Glossary

This glossary defines a number of technical terms that may be encountered. In some
cases, the terms have not been used in the generally accepted way.

ASCII
American Standard Code for Information Interchange.
ASCII is the traditional ULTRIX coded-character set and defines 128
characters, including both control characters and graphic characters, represented
by 7-bit binary values (see also ISO 646).

Character
A member of a set of elements used for the organization, control, or
representation of text.

Character Set
A set of alphabetic or other characters used to construct the words and other
elementary units of a national language or a computer language.

Coded Character Set
A set of unambiguous rules that establishes a character set and the one-to-one
relationship between each character of the set and its bit representation.

Collating Sequence
The ordering sequence applied to characters or a group of characters when they
are sorted.

Composite Graphic Symbol
A graphic symbol consisting of a combination of two or more other graphic
symbols in a single character position, such as a diacritical mark and a basic
letter.

Control Character
A character, other than a graphic character, that affects the recording,
processing, transmission, or interpretation of text.

Downshifting
The conversion of an uppercase character to its lowercase representation.

Graphic Character
A character, other than a control character, that has a visual representation when
hand-written, printed, or displayed.

Internationalization
The provision within a computer program for adapting to the requirements of
different national languages, local customs, and coded character sets.

ISO 646
ISO 7 -bit coded character set for information interchange. The reference version
of ISO 646 contains 95 graphic characters, which are identical to the graphic
characters defined in the ASCII coded character set.

ISO 6937
ISO 7 -bit or 8-bit coded character set for text communication using public
communication networks, private communication networks, or interchange
media such as magnetic tapes and discs.

ISO 8859/1
ISO 8-bit single-byte coded character set Part 1, Latin Alphabet No.1. The ISO
8859/1 character set comprises 191 graphic characters covering the
requirements of most of Western Europe.

LANG
The environment variable LANG, used to announce the user's requirements for
national language, local customs, and coded character set to the computer
system.

Local Customs
Refers to the conventions of a geographical area or territory for such things as
date, time, and currency formats.

Localization
The process of establishing the run-time environment of an internationalized
computer program to meet the requirements of particular national languages,
local customs, and character sets.
MCS
Digital Equipment Corporation's Multinational Character Set. This is based on
ISO 8859/1. It covers the requirements of most Western European languages
but also includes special computer oriented symbols.

2 Glossary

Message Catalog
A file or storage area containing program messages, command prompts, and
responses to prompts for a particular national language, territory, and codeset.

National Language
A computer user's spoken or written language, such as English, French, Italian,
or Spanish.

NLSPATH
An environment variable used to indicate the search path for message catalogs.

Non-spacing Characters
A character, such as a character representing a diacritical mark in the ISO 6937
coded character set, which is used in combination with other characters to form
composite graphic symbols.

Radix Character
The character that separates the integer part of a number from the fractional
part.

Upshifting
The conversion of a lowercase character to its uppercase representation.

Glossary 3

Index

atof library routine, 3-1

ecvt library routine, 3-1
extract command, 2-2, 2-8, 2-2

c
catclose library routine, 2-8, 2-1
catgetmsg library routine, 2-4, 2-6, 2-2
catgets library routine, 2-4, 2-6, 2-8, 2-2
catopen library routine, 2-8, 2-1
codeset definition
combined codes, 4-2
identifiers, 4-2
keyword assignment, 4-2
restrictions, 4-2
simple codes, 4-2
collation table
one-to-one character mappings, 4-5
one-to-two character mappings, 4-5

G
gencat command, 2-3, 2-5, 2-6
creating, 2-1

ic command, 4--1, 4-2, 4--5, 4--7
internationalization
conversion table, 4--8
defined, 1-1
keyboard support, 1-2
list of associated reference pages, C-l
program example, 3-5
purpose of, 1-1

ordering of statements, 4-5
two-to-one character mappings, 4-5

cony library routine, 3-1
ctype library routine, 3-1,4-3

M
message catalog
creating, 2-1
message text source file
deleting message sets, 2-3

database language source file
ASCII characters, 4-1
C-style comments, 4--1
macro definitions, 4-1
white space, 4-1

example code fragments, 2-5
specifying mnemonics, 2-5
specifying set numbers, 2-3

N
nl_langinfo library routine, 3-1

p
printf library routine, 3-1

s
scanf library routine, 3-1
setlocale library routine, 3-1, 3-3, 3-4
source language file, B-1e
strcoll library routine, 3-1
strextract command, 2-2, 2-8, 2-2
strftime library routine, 3-1
string extraction