perluniintro - Perl Unicode introduction
This document gives a general idea of Unicode and how to use Unicode
in Perl.
Unicode is a character set standard which plans to codify all of the
writing systems of the world, plus many other symbols.
Unicode and ISO/IEC 10646 are coordinated standards that provide code
points for characters in almost all modern character set standards,
covering more than 30 writing systems and hundreds of languages,
including all commercially-important modern languages. All characters
in the largest Chinese, Japanese, and Korean dictionaries are also
encoded. The standards will eventually cover almost all characters in
more than 250 writing systems and thousands of languages.
A Unicode character is an abstract entity. It is not bound to any
particular integer width, especially not to the C language char.
Unicode is language-neutral and display-neutral: it does not encode the
language of the text and it does not define fonts or other graphical
layout details. Unicode operates on characters and on text built from
those characters.
Unicode defines characters like LATIN CAPITAL LETTER A or C and unique numbers for the characters, in this
case 0x0041 and 0x03B1, respectively. These unique numbers are called
code points.
The Unicode standard prefers using hexadecimal notation for the code
points. If numbers like 0x0041 are unfamiliar to
you, take a peek at a later section, ``Hexadecimal Notation''.
The Unicode standard uses the notation U+0041 LATIN CAPITAL LETTER A,
to give the hexadecimal code point and the normative name of
the character.
Unicode also defines various properties for the characters, like
``uppercase'' or ``lowercase'', ``decimal digit'', or ``punctuation'';
these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.
A Unicode character consists either of a single code point, or a
base character (like LATIN CAPITAL LETTER A), followed by one or
more modifiers (like COMBINING ACUTE ACCENT). This sequence of
base character and modifiers is called a IWhether to call these combining character sequences ``characters''
depends on your point of view. If you are a programmer, you probably
would tend towards seeing each element in the sequences as one unit,
or ``character''. The whole sequence could be seen as one ``character'',
however, from the user's point of view, since that's probably what it
looks like in the context of the user's language.
With this ``whole sequence'' view of characters, the total number of
characters is open-ended. But in the programmer's "one unit is one
character`` point of view, the concept of ''characters" is more
deterministic. In this document, we take that second point of view:
one ``character'' is one Unicode code point, be it a base character or
a combining character.
For some combinations, there are precomposed characters.
LATIN CAPITAL LETTER A WITH ACUTE, for example, is defined as
a single code point. These precomposed characters are, however,
only available for some combinations, and are mainly
meant to support round-trip conversions between Unicode and legacy
standards (like the ISO 8859). In the general case, the composing
method is more extensible. To support conversion between
different compositions of the characters, various IBecause of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
``at least one number for every character''. The same character could
be represented differently in several legacy encodings. The
converse is also not true: some code points do not have an assigned
character. Firstly, there are unallocated code points within
otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.
A common myth about Unicode is that it would be ``16-bit'', that is,
Unicode is only represented as 0x10000 (or 65536) characters from
0x0000 to 0xFFFF. This is untrue. Since Unicode 2.0, Unicode
has been defined all the way up to 21 bits (0x10FFFF), and since
Unicode 3.1, characters have been defined beyond 0xFFFF. The first
0x10000 characters are called the Plane 0, or the IAnother myth is that the 256-character blocks have something to
do with languages--that each block would define the characters used
by a language or a set of languages. This is also untrue.
The division into blocks exists, but it is almost completely
accidental--an artifact of how the characters have been and
still are allocated. Instead, there is a concept called scripts,
which is more useful: there is Latin script, Greek script, and
so on. Scripts usually span varied parts of several blocks.
For further information see Unicode::UCD.
The Unicode code points are just abstract numbers. To input and
output these abstract numbers, the numbers must be encoded somehow.
Unicode defines several character encoding forms, of which UTF-8
is perhaps the most popular. UTF-8 is a variable length encoding that
encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
defined characters). Other encodings include UTF-16 and UTF-32 and their
big- and little-endian variants (UTF-8 is byte-order independent)
The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
For more information about encodings--for instance, to learn what
surrogates and byte order marks (BOMs) are--see
the perlunicode manpage
.
Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
natively. Perl 5.8.0, however, is the first recommended release for
serious Unicode work. The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
Starting from Perl 5.8.0, the use of C |