The University of Auckland - Unix man pages: charsets (4)

Computer Science

CHARSETS(4)         Linux Programmer's Manual         CHARSETS(4)

NAME
       charsets  - programmer's view of character sets and inter-
       nationalization

DESCRIPTION
       Linux is an international operating  system.   Various  of
       its  utilities  and  device drivers (including the console
       driver)  support  multilingual  character  sets  including
       Latin-alphabet  letters  with  diacritical marks, accents,
       ligatures, and entire non-Latin alphabets including Greek,
       Cyrillic, Arabic, and Hebrew.

       This  manual page presents a programmer's-eye view of dif-
       ferent character-set standards and how they  fit  together
       on  Linux.   Standards  discussed include ASCII, ISO 8859,
       KOI8-R, Unicode, ISO 2022 and ISO 4873.

ASCII
       ASCII (American Standard  Code  For  Information)  is  the
       original  7-bit  character  set,  originally  designed for
       American English.  It is currently described by the ECMA-6
       standard.

       An     ASCII     variant     replacing     the    American
       crosshatch/octothorpe/hash pound symbol with  the  British
       pound-sterling  symbol  is  used  in  Great  Britain; when
       needed, the American and British variants may  be  distin-
       guished as "US ASCII" and "UK ASCII".

       As  Linux  was written for hardware designed in the US, it
       natively supports US ASCII.

ISO 8859
       ISO 8859 is a series of 10 8-bit  character  sets  all  of
       which  have  US ASCII in their low (7-bit) half, invisible
       control characters in positions 128 to 159, and 96  fixed-
       width graphics in positions 160-255.

       Of  these, the most important is ISO 8859-1 (Latin-1).  It
       is natively supported in the Linux console driver,  fairly
       well  supported in X11R6, and is the base character set of
       HTML.

       Console support for  the  other  8859  character  sets  is
       available under Linux through user-mode utilities (such as
       setfont(8)) that modify  keyboard  bindings  and  the  EGA
       graphics table and employ the "user mapping" font table in
       the console driver.

       Here are brief descriptions of each set:

       8859-1 (Latin-1)
              Latin-1 covers most Western European languages such
              as   Albanian,  Catalan,  Danish,  Dutch,  English,
              Faroese, Finnish, French, German, Galician,  Irish,
              Icelandic, Italian, Norwegian, Portuguese, Spanish,
              and Swedish. The lack of the  ligatures  Dutch  ij,
              French  oe and old-style ,,German`` quotation marks
              is tolerable.

       8859-2 (Latin-2)
              Latin-2 supports most Latin-written Slavic and Cen-
              tral  European  languages: Croatian, Czech, German,
              Hungarian, Polish, Rumanian, Slovak, and Slovene.

       8859-3 (Latin-3)
              Latin-3 is popular with authors of Esperanto, Gali-
              cian, Maltese, and Turkish.

       8859-4 (Latin-4)
              Latin-4  introduced  letters for Estonian, Latvian,
              and Lithuanian.  It is  essentially  obsolete;  see
              8859-10 (Latin-6).

       8859-5 Cyrillic  letters  supporting  Bulgarian, Byelorus-
              sian, Macedonian, Russian, Serbian  and  Ukrainian.
              Ukrainians read the letter `ghe' with downstroke as
              `heh' and would need a ghe with upstroke to write a
              correct ghe.  See the discussion of KOI8-R below.

       8859-6 Supports Arabic.  The 8859-6 glyph table is a fixed
              font of separate letter forms, but a proper display
              engine  should  combine these using the proper ini-
              tial, medial, and final forms.

       8859-7 Supports Modern Greek.

       8859-8 Supports Hebrew.

       8859-9 (Latin-5)
              This is a variant of Latin-1 that replaces  rarely-
              used Icelandic letters with Turkish ones.

       8859-10 (Latin-6)
              Latin  6 adds the last Inuit (Greenlandic) and Sami
              (Lappish) letters that were missing in Latin  4  to
              cover  the  entire  Nordic area.  RFC 1345 listed a
              preliminary  and  different  `latin6'.  Skolt  Sami
              still needs a few more accents than these.

KOI8-R
       KOI8-R  is a non-ISO character set popular in Russia.  The
       lower half is US ASCII; the upper is a Cyrillic  character
       set somewhat better designed than ISO 8859-5.

       Console  support  for  KOI8-R  is  available  under  Linux
       through user-mode utilities that modify keyboard  bindings
       and  the EGA graphics table, and employ the "user mapping"
       font table in the console driver.

UNICODE
       Unicode (ISO 10646) is a standard which aims to  unambigu-
       ously represent every known glyph in every human language.
       Unicode's native encoding is 32-bit (older  versions  used
       16   bits).    Information  on  Unicode  is  available  at
       <http://www.unicode.com>.

       Linux represents Unicode using the 8-bit Unicode  Transfer
       Format  (UTF-8).   UTF-8  is a variable length encoding of
       Unicode.  It uses 1 byte to code 7 bits, 2  bytes  for  11
       bits,  3  bytes  for 16 bits, 4 bytes for 21 bits, 5 bytes
       for 26 bits, 6 bytes for 31 bits.

       Let 0,1,x stand for a zero, one, or arbitrary bit.  A byte
       0xxxxxxx  stands  for  the Unicode 00000000 0xxxxxxx which
       codes the same symbol as the ASCII 0xxxxxxx.  Thus,  ASCII
       goes  unchanged into UTF-8, and people using only ASCII do
       not notice any change: not in code, and not in file  size.

       A  byte  110xxxxx  is  the  start  of  a  2-byte code, and
       110xxxxx 10yyyyyy is assembled into 00000xxx xxyyyyyy.   A
       byte  1110xxxx is the start of a 3-byte code, and 1110xxxx
       10yyyyyy 10zzzzzz is  assembled  into  xxxxyyyy  yyzzzzzz.
       (When UTF-8 is used to code the 31-bit ISO 10646 then this
       progression continues up to 6-byte codes.)

       For ISO-8859-1 users this means that the  characters  with
       high  bit  set now are coded with two bytes. This tends to
       expand ordinary text files by one or two  percent.   There
       are  no  conversion  problems,  however, since the Unicode
       value of ISO-8859-1 symbols equals their ISO-8859-1  value
       (extended by eight leading zero bits).  For Japanese users
       this means that the 16-bit codes now in  common  use  will
       take   three  bytes,  and  extensive  mapping  tables  are
       required. Many Japanese therefore prefer ISO 2022.

       Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail,
       any  other byte is the head of a code.  Note that the only
       way ASCII bytes occur in a UTF-8 stream, is as themselves.
       In  particular,  there  are  no embedded NULs or '/'s that
       form part of some larger code.

       Since  ASCII,  and,  in  particular,  NUL  and  '/',   are
       unchanged,  the kernel does not notice that UTF-8 is being
       used. It does not care at all what the bytes  it  is  han-
       dling stand for.

       Rendering  of  Unicode  data  streams is typically handled
       through `subfont' tables which map a subset of Unicode  to
       glyphs.   Internally  the  kernel uses Unicode to describe
       the subfont loaded in video RAM.  This means that in UTF-8
       mode  one  can use a character set with 512 different sym-
       bols.  This  is  not  enough  for  Japanese,  Chinese  and
       Korean, but it is enough for most other purposes.

ISO 2022 AND ISO 4873
       The  ISO  2022  and 4873 standards describe a font-control
       model based on VT100 practice.  This model is  (partially)
       supported by the Linux kernel and by xterm(1).  It is pop-
       ular in Japan and Korea.

       There are 4 graphic character sets, called G0, G1, G2  and
       G3, and one of them is the current character set for codes
       with high bit zero (initially G0), and one of them is  the
       current  character  set  for codes with high bit one (ini-
       tially G1).  Each graphic character set has 94 or 96 char-
       acters,  and is essentially a 7-bit character set. It uses
       codes either 040-0177 (041-0176) or 0240-0377 (0241-0376).
       G0 always has size 94 and uses codes 041-0176.

       Switching  between  character sets is done using the shift
       functions ^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC
       o  (LS3),  ESC  N  (SS2), ESC O (SS3), ESC ~ (LS1R), ESC }
       (LS2R), ESC | (LS3R).  The function  LSn  makes  character
       set  Gn the current one for codes with high bit zero.  The
       function LSnR makes character set Gn the current  one  for
       codes with high bit one.  The function SSn makes character
       set Gn (n=2 or 3) the current one for the  next  character
       only (regardless of the value of its high order bit).

       A 94-character set is designated as Gn character set by an
       escape sequence ESC ( xx (for G0), ESC ) xx (for G1),  ESC
       *  xx (for G2), ESC + xx (for G3), where xx is a symbol or
       a pair of symbols found in the ISO 2375 International Reg-
       ister  of  Coded  Character  Sets.   For  example, ESC ( @
       selects the ISO 646 character set as G0, ESC (  A  selects
       the  UK standard character set (with pound instead of num-
       ber sign), ESC ( B selects ASCII (with dollar  instead  of
       currency  sign),  ESC  (  M  selects  a  character set for
       African languages, ESC ( ! A selects the  Cuban  character
       set, etc. etc.

       A 96-character set is designated as Gn character set by an
       escape sequence ESC - xx (for G1), ESC . xx  (for  G2)  or
       ESC  /  xx  (for  G3).   For  example, ESC - G selects the
       Hebrew alphabet as G1.

       A multibyte character set is designated  as  Gn  character
       set by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
       ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx  (for
       G3).   For example, ESC $ ( C selects the Korean character
       set for G0.  The Japanese character set selected by ESC  $
       B has a more recent version selected by ESC & @ ESC $ B.

       ISO  4873  stipulates  a  narrower  use of character sets,
       where G0 is fixed (always ASCII), so that G1,  G2  and  G3
       can only be invoked for codes with the high order bit set.
       In particular, ^N and ^O are not used anymore,  ESC  (  xx
       can  be used only with xx=B, and ESC ) xx, ESC * xx, ESC +
       xx are equivalent to ESC - xx, ESC . xx, ESC / xx, respec-
       tively.

SEE ALSO
       console(4), console_ioctl(4), console_codes(4)

Linux                   November 5th, 1996                      1
Back to the index
Related Programmes
Apply now!
Handbook
Postgraduate study options
Computer Science Blog
Computer Science

Please give us your feedback or ask us a question