The University of Auckland - Unix man pages: UTF-8 (7)

Computer Science

UTF-8(7)            Linux Programmer's Manual            UTF-8(7)

NAME
       UTF-8 - an ASCII compatible multibyte Unicode encoding

DESCRIPTION
       The  Unicode  character  set occupies a 16-bit code space.
       The most obvious Unicode encoding (known  as  UCS-2)  con-
       sists of a sequence of 16-bit words. Such strings can con-
       tain as parts of many 16-bit characters bytes like '\0' or
       '/'  which have a special meaning in filenames and other C
       library function parameters.  In addition, the majority of
       UNIX tools expects ASCII files and can't read 16-bit words
       as characters without major modifications. For these  rea-
       sons, UCS-2 is not a suitable external encoding of Unicode
       in filenames, text files, environment variables, etc.  The
       ISO  10646  Universal  Character  Set (UCS), a superset of
       Unicode, occupies even a 31-bit code space and the obvious
       UCS-4  encoding   for  it (a sequence of 32-bit words) has
       the same problems.

       The UTF-8 encoding of Unicode and UCS does not have  these
       problems  and is the way to go for using the Unicode char-
       acter set under Unix-style operating systems.

PROPERTIES
       The UTF-8 encoding has the following nice properties:

       * UCS characters 0x00000000 to 0x0000007f  (the  classical
         US-ASCII characters) are encoded simply as bytes 0x00 to
         0x7f (ASCII compatibility). This means  that  files  and
         strings  which  contain only 7-bit ASCII characters have
         the same encoding under both ASCII and UTF-8.

       * All UCS characters > 0x7f are  encoded  as  a  multibyte
         sequence  consisting  only of bytes in the range 0x80 to
         0xfd, so no ASCII byte can appear  as  part  of  another
         character  and  there  are no problems with e.g. '\0' or
         '/'.

       * The lexicographic sorting order of UCS-4 strings is pre-
         served.

       * All  possible 2^31 UCS codes can be encoded using UTF-8.

       * The bytes 0xfe and 0xff are  never  used  in  the  UTF-8
         encoding.

       * The  first byte of a multibyte sequence which represents
         a single non-ASCII UCS character is always in the  range
         0xc0  to  0xfd  and  indicates  how  long this multibyte
         sequence is. All further bytes in a  multibyte  sequence
         are  in  the range 0x80 to 0xbf. This allows easy resyn-
         chronization and makes the encoding stateless and robust
         against missing bytes.

       * UTF-8  encoded  UCS  characters  may  be up to six bytes
         long, however Unicode characters can only be up to three
         bytes long. As Linux uses only the 16-bit Unicode subset
         of UCS, under Linux, UTF-8 multibyte sequences can  only
         be one, two or three bytes long.

ENCODING
       The following byte sequences are used to represent a char-
       acter. The sequence to be used depends  on  the  UCS  code
       number of the character:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       The  xxx  bit  positions  are  filled with the bits of the
       character code number in binary representation.  Only  the
       shortest  possible  multibyte sequence which can represent
       the code number of the character can be used.

EXAMPLES
       The Unicode character 0xa9  =  1010  1001  (the  copyright
       sign) is encoded in UTF-8 as

              11000010 10101001 = 0xc2 0xa9

       and  character  0x2260  =  0010  0010  0110 0000 (the "not
       equal" symbol) is encoded as:

              11100010 10001001 10100000 = 0xe2 0x89 0xa0

STANDARDS
       ISO 10646, Unicode 1.1, XPG4, Plan 9.

AUTHOR
       Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>

SEE ALSO
       unicode(7)

Linux                       1995-11-26                          1
Back to the index
Related Programmes
Apply now!
Handbook
Postgraduate study options
Computer Science Blog
Computer Science

Please give us your feedback or ask us a question