The University of Auckland - Unix man pages: unicode (7)

Computer Science

UNICODE(7)          Linux Programmer's Manual          UNICODE(7)

NAME
       Unicode - the unified 16-bit super character set

DESCRIPTION
       The international standard ISO 10646 defines the Universal
       Character Set (UCS).  UCS contains all characters  of  all
       other  character  set standards. It also guarantees round-
       trip compatibility, i.e., conversion tables can  be  built
       such  that  no  information  is lost when a string is con-
       verted from any other encoding to UCS and back.

       UCS contains the characters required to  represent  almost
       all  known  languages.  This  includes apart from the many
       languages which use extensions of the  Latin  script  also
       the  following  scripts  and  languages:  Greek, Cyrillic,
       Hebrew, Arabic, Armenian,  Gregorian,  Japanese,  Chinese,
       Hiragana,  Katakana,  Korean,  Hangul, Devangari, Bengali,
       Gurmukhi,  Gujarati,  Oriya,   Tamil,   Telugu,   Kannada,
       Malayam, Thai, Lao, Bopomofo, and a number of others. Work
       is going on to  include  further  scripts  like  Tibetian,
       Khmer, Runic, Ethiopian, Hieroglyphics, various Indo-Euro-
       pean languages, and many others. For most of these  latter
       scripts, it was not yet clear how they can be encoded best
       when the standard was published in 1993.  In  addition  to
       the  characters  required  by  these scripts, also a large
       number of graphical, typographical, mathematical and  sci-
       entific  symbols  like  those provided by TeX, PostScript,
       MS-DOS, Macintosh, Videotext, OCR, and many word  process-
       ing  systems  have been included, as well as special codes
       that  guarantee  round-trip  compatibility  to  all  other
       existing character set standards.

       The  UCS standard (ISO 10646) describes a 31-bit character
       set architecture, however, today only the first 65534 code
       positions  (0x0000  to 0xfffd), which are called the Basic
       Multilingual Plane (BMP), have been  assigned  characters,
       and  it is expected that only very exotic characters (e.g.
       Hieroglyphics) for special scientific purposes  will  ever
       get a place outside this 16-bit BMP.

       The UCS characters 0x0000 to 0x007f are identical to those
       of the classic US-ASCII character set and  the  characters
       in  the  range  0x0000 to 0x00ff are identical to those in
       the ISO 8859-1 Latin-1 character set.

COMBINING CHARACTERS
       Some code points in UCS have been  assigned  to  combining
       characters.   These  are similar to the non-spacing accent
       keys on a typewriter. A combining character just  adds  an
       accent  to  the  previous  character.   The most important
       accented characters have codes of their own in  UCS,  how-
       ever,  the  combining  character  mechanism  allows to add
       accents and other diacritical marks to any character.  The
       combining  characters  always  follow  the character which
       they modify. For example, the  German  character  Umlaut-A
       ("Latin  capital  letter  A with diaeresis") can either be
       represented by the precomposed UCS code 0x00c4, or  alter-
       natively  as  the  combination  of a normal "Latin capital
       letter A" followed  by  a  "combining  diaeresis":  0x0041
       0x0308.

IMPLEMENTATION LEVELS
       As not all systems are expected to support advanced mecha-
       nisms like combining characters, ISO 10646  specifies  the
       following three implementation levels of UCS:

       Level 1  Combining  characters  and Hangul Jamo characters
                (a special,  more  complicated  encoding  of  the
                Korean  script,  where Hangul syllables are coded
                as two or three subcharacters) are not supported.

       Level 2  Like  level 1, however in some scripts, some com-
                bining  characters  are  now  allowed  (e.g.  for
                Hebrew,  Arabic,  Devangari,  Bengali,  Gurmukhi,
                Gujarati, Oriya, Tamil, Telugo,  Kannada,  Malay-
                alam, Thai and Lao).

       Level 3  All UCS characters are supported.

       The  Unicode 1.1 standard published by the Unicode Consor-
       tium contains exactly the UCS Basic Multilingual Plane  at
       implementation level 3, as described in ISO 10646. Unicode
       1.1 also adds some semantical definitions for some charac-
       ters to the definitions of ISO 10646.

UNICODE UNDER LINUX
       Under Linux, only the BMP at implementation level 1 should
       be used at the moment, in order to keep the implementation
       complexity  of combining characters low. The higher imple-
       mentation levels are more suitable for special  word  pro-
       cessing  formats,  but  not  as a generic system character
       set. The C type wchar_t is on  Linux  an  unsigned  16-bit
       integer type and its values are interpreted as UCS level 1
       BMP codes.

       The locale setting specifies, whether the system character
       encoding  is  for  example  UTF-8  or ISO 8859-1.  Library
       functions like wctomb, mbtowc, or wprintf can be  used  to
       transform the internal wchar_t characters and strings into
       the system character encoding and back.

PRIVATE AREA
       In the BMP, the range  0xe000  to  0xf8ff  will  never  be
       assigned  any  characters  by the standard and is reserved
       for private usage. For the Linux community,  this  private
       area  has been subdivided further into the range 0xe000 to
       0xefff which can be used individually by any end-user  and
       the  Linux zone in the range 0xf000 to 0xf8ff where exten-
       sions are coordinated among all Linux users. The  registry
       of  the characters assigned to the Linux zone is currently
       maintained  by  H.  Peter  Anvin  <Peter.Anvin@linux.org>,
       Yggdrasil  Computing,  Inc.  It  contains  some  DEC VT100
       graphics  characters  missing  in  Unicode,  gives  direct
       access  to  the  characters in the console font buffer and
       contains the characters used by  a  few  advanced  scripts
       like Klingon.

LITERATURE
       * Information  technology - Universal Multiple-Octet Coded
         Character Set (UCS) - Part  1:  Architecture  and  Basic
         Multilingual Plane.  International Standard ISO 10646-1,
         International Organization for Standardization,  Geneva,
         1993.

         This is the official specification of UCS.  Pretty offi-
         cial, pretty thick, and pretty expensive.  For  ordering
         information, check www.iso.ch.

       * The Unicode Standard - Worldwide Character Encoding Ver-
         sion 1.0.  The Unicode Consortium, Addison-Wesley, Read-
         ing, MA, 1991.

         There is already Unicode 1.1.4 available. The changes to
         the 1.0 book are available from ftp.unicode.org. Unicode
         2.0 will be published again as a book in 1996.

       * S.  Harbison,  G. Steele. C - A Reference Manual. Fourth
         edition, Prentice Hall,  Englewood  Cliffs,  1995,  ISBN
         0-13-326224-3.

         A  good reference book about the C programming language.
         The fourth edition now covers also the 1994 Amendment  1
         to  the  ISO C standard (ISO/IEC 9899:1990) which adds a
         large number of new C  library  functions  for  handling
         wide character sets.

BUGS
       At the time when this man page was written, the Linux libc
       support for UCS was far from complete.

AUTHOR
       Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>

SEE ALSO
       utf-8(7)

Linux                       1995-12-27                          1
Back to the index
Related Programmes
Apply now!
Handbook
Postgraduate study options
Computer Science Blog
Computer Science

Please give us your feedback or ask us a question