Computer Science
UNICODE(7) Linux Programmer's Manual UNICODE(7)
NAME
Unicode - the unified 16-bit super character set
DESCRIPTION
The international standard ISO 10646 defines the Universal
Character Set (UCS). UCS contains all characters of all
other character set standards. It also guarantees round-
trip compatibility, i.e., conversion tables can be built
such that no information is lost when a string is con-
verted from any other encoding to UCS and back.
UCS contains the characters required to represent almost
all known languages. This includes apart from the many
languages which use extensions of the Latin script also
the following scripts and languages: Greek, Cyrillic,
Hebrew, Arabic, Armenian, Gregorian, Japanese, Chinese,
Hiragana, Katakana, Korean, Hangul, Devangari, Bengali,
Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayam, Thai, Lao, Bopomofo, and a number of others. Work
is going on to include further scripts like Tibetian,
Khmer, Runic, Ethiopian, Hieroglyphics, various Indo-Euro-
pean languages, and many others. For most of these latter
scripts, it was not yet clear how they can be encoded best
when the standard was published in 1993. In addition to
the characters required by these scripts, also a large
number of graphical, typographical, mathematical and sci-
entific symbols like those provided by TeX, PostScript,
MS-DOS, Macintosh, Videotext, OCR, and many word process-
ing systems have been included, as well as special codes
that guarantee round-trip compatibility to all other
existing character set standards.
The UCS standard (ISO 10646) describes a 31-bit character
set architecture, however, today only the first 65534 code
positions (0x0000 to 0xfffd), which are called the Basic
Multilingual Plane (BMP), have been assigned characters,
and it is expected that only very exotic characters (e.g.
Hieroglyphics) for special scientific purposes will ever
get a place outside this 16-bit BMP.
The UCS characters 0x0000 to 0x007f are identical to those
of the classic US-ASCII character set and the characters
in the range 0x0000 to 0x00ff are identical to those in
the ISO 8859-1 Latin-1 character set.
COMBINING CHARACTERS
Some code points in UCS have been assigned to combining
characters. These are similar to the non-spacing accent
keys on a typewriter. A combining character just adds an
accent to the previous character. The most important
accented characters have codes of their own in UCS, how-
ever, the combining character mechanism allows to add
accents and other diacritical marks to any character. The
combining characters always follow the character which
they modify. For example, the German character Umlaut-A
("Latin capital letter A with diaeresis") can either be
represented by the precomposed UCS code 0x00c4, or alter-
natively as the combination of a normal "Latin capital
letter A" followed by a "combining diaeresis": 0x0041
0x0308.
IMPLEMENTATION LEVELS
As not all systems are expected to support advanced mecha-
nisms like combining characters, ISO 10646 specifies the
following three implementation levels of UCS:
Level 1 Combining characters and Hangul Jamo characters
(a special, more complicated encoding of the
Korean script, where Hangul syllables are coded
as two or three subcharacters) are not supported.
Level 2 Like level 1, however in some scripts, some com-
bining characters are now allowed (e.g. for
Hebrew, Arabic, Devangari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugo, Kannada, Malay-
alam, Thai and Lao).
Level 3 All UCS characters are supported.
The Unicode 1.1 standard published by the Unicode Consor-
tium contains exactly the UCS Basic Multilingual Plane at
implementation level 3, as described in ISO 10646. Unicode
1.1 also adds some semantical definitions for some charac-
ters to the definitions of ISO 10646.
UNICODE UNDER LINUX
Under Linux, only the BMP at implementation level 1 should
be used at the moment, in order to keep the implementation
complexity of combining characters low. The higher imple-
mentation levels are more suitable for special word pro-
cessing formats, but not as a generic system character
set. The C type wchar_t is on Linux an unsigned 16-bit
integer type and its values are interpreted as UCS level 1
BMP codes.
The locale setting specifies, whether the system character
encoding is for example UTF-8 or ISO 8859-1. Library
functions like wctomb, mbtowc, or wprintf can be used to
transform the internal wchar_t characters and strings into
the system character encoding and back.
PRIVATE AREA
In the BMP, the range 0xe000 to 0xf8ff will never be
assigned any characters by the standard and is reserved
for private usage. For the Linux community, this private
area has been subdivided further into the range 0xe000 to
0xefff which can be used individually by any end-user and
the Linux zone in the range 0xf000 to 0xf8ff where exten-
sions are coordinated among all Linux users. The registry
of the characters assigned to the Linux zone is currently
maintained by H. Peter Anvin <Peter.Anvin@linux.org>,
Yggdrasil Computing, Inc. It contains some DEC VT100
graphics characters missing in Unicode, gives direct
access to the characters in the console font buffer and
contains the characters used by a few advanced scripts
like Klingon.
LITERATURE
* Information technology - Universal Multiple-Octet Coded
Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane. International Standard ISO 10646-1,
International Organization for Standardization, Geneva,
1993.
This is the official specification of UCS. Pretty offi-
cial, pretty thick, and pretty expensive. For ordering
information, check www.iso.ch.
* The Unicode Standard - Worldwide Character Encoding Ver-
sion 1.0. The Unicode Consortium, Addison-Wesley, Read-
ing, MA, 1991.
There is already Unicode 1.1.4 available. The changes to
the 1.0 book are available from ftp.unicode.org. Unicode
2.0 will be published again as a book in 1996.
* S. Harbison, G. Steele. C - A Reference Manual. Fourth
edition, Prentice Hall, Englewood Cliffs, 1995, ISBN
0-13-326224-3.
A good reference book about the C programming language.
The fourth edition now covers also the 1994 Amendment 1
to the ISO C standard (ISO/IEC 9899:1990) which adds a
large number of new C library functions for handling
wide character sets.
BUGS
At the time when this man page was written, the Linux libc
support for UCS was far from complete.
AUTHOR
Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
SEE ALSO
utf-8(7)
Linux 1995-12-27 1
Back to the index