Perlunicode (1) Linux Manual Page

NAME

perlunicode – Unicode support in Perl

DESCRIPTION

If you haven’t already, before reading this document, you should become familiar with both perlunitut and perluniintro.

Unicode aims to UNI-fy the en-CODE-ings of all the world’s character sets into a single Standard. For quite a few of the various coding standards that existed when Unicode was first created, converting from each to Unicode essentially meant adding a constant to each code point in the original standard, and converting back meant just subtracting that same constant. For ASCII and ISO-8859-1, the constant is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew (ISO-8859-8), it’s 1488; Thai (ISO-8859-11), 3424; and so forth. This made it easy to do the conversions, and facilitated the adoption of Unicode.

And it worked; nowadays, those legacy standards are rarely used. Most everyone uses Unicode.

Unicode is a comprehensive standard. It specifies many things outside the scope of Perl, such as how to display sequences of characters. For a full discussion of all aspects of Unicode, see <http://www.unicode.org>.

Important Caveats

Even though some of this section may not be understandable to you on first reading, we think it’s important enough to highlight some of the gotchas before delving further, so here goes:

Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features.

Also, the use of Unicode may present security issues that aren’t obvious, see “Security Implications of Unicode” below.

Safest if you "use feature ‘unicode_strings’"

In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma "use feature 'unicode_strings'" is specified. (This is automatically selected if you "use 5.012" or higher.) Failure to do this can trigger unexpected surprises. See “The ”Unicode Bug"" below.

This pragma doesn’t affect I/O. Nor does it change the internal representation of strings, only their interpretation. There are still several places where Unicode isn’t fully supported, such as in filenames.

Input and Output Layers

Use the ":encoding(...)" layer to read from and write to filehandles using the specified encoding. (See open.)

You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8.

The encoding module has been deprecated since perl 5.18 and the perl internals it requires have been removed with perl 5.26.

"use utf8" still needed to enable UTF-8 in scripts

If your Perl script is itself encoded in UTF-8, the "use utf8" pragma must be explicitly included to enable recognition of that (in string or regular expression literals, or in identifier names). This is the only time when an explicit "use utf8" is needed. (See utf8).

If a Perl script begins with the bytes that form the UTF-8 encoding of the Unicode BYTE ORDER MARK ("BOM", see “Unicode Encodings”), those bytes are completely ignored.

UTF-16 scripts autodetected

If a Perl script begins with the Unicode "BOM" (UTF-16LE, UTF16-BE), or if the script looks like non-"BOM"-marked UTF-16 of either endianness, Perl will correctly read in the script as the appropriate Unicode encoding.

Byte and Character Semantics

Before Unicode, most encodings used 8 bits (a single byte) to encode each character. Thus a character was a byte, and a byte was a character, and there could be only 256 or fewer possible characters. “Byte Semantics” in the title of this section refers to this behavior. There was no need to distinguish between “Byte” and “Character”.

Then along comes Unicode which has room for over a million characters (and Perl allows for even more). This means that a character may require more than a single byte to represent it, and so the two terms are no longer equivalent. What matter are the characters as whole entities, and not usually the bytes that comprise them. That’s what the term “Character Semantics” in the title of this section refers to.

Perl had to change internally to decouple “bytes” from “characters”. It is important that you too change your ideas, if you haven’t already, so that “byte” and “character” no longer mean the same thing in your mind.

The basic building block of Perl strings has always been a “character”. The changes basically come down to that the implementation no longer thinks that a character is always just a single byte.

There are various things to note:

•

String handling functions, for the most part, continue to operate in terms of characters. "length()", for example, returns the number of characters in a string, just as before. But that number no longer is necessarily the same as the number of bytes in the string (there may be more bytes than characters). The other such functions include "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()", "sort()", "sprintf()", and "write()".

The exceptions are:

•

the bit-oriented "vec"

•

the byte-oriented "pack"/"unpack" "C" format

However, the "W" specifier does operate on whole characters, as does the "U" specifier.

•

some operators that interact with the platform’s operating system

Operators dealing with filenames are examples.

•

when the functions are called from within the scope of the "use bytes" pragma

Likely, you should use this only for debugging anyway.

•

Strings—including hash keys—and regular expression patterns may contain characters that have ordinal values larger than 255.

If you use a Unicode editor to edit your program, Unicode characters may occur directly within the literal strings in UTF-8 encoding, or UTF-16. (The former requires a "use utf8", the latter may require a "BOM".)

“Creating Unicode” in perluniintro gives other ways to place non-ASCII characters in your strings.

•

The "chr()" and "ord()" functions work on whole characters.

•

Regular expressions match whole characters. For example, "." matches a whole character instead of only a single byte.

•

The "tr///" operator translates whole characters. (Note that the "tr///CU" functionality has been removed. For similar functionality to that, see "pack('U0', ...)" and "pack('C0', ...)").

•

"scalar reverse()" reverses by character rather than by byte.

•

The bit string operators, "& | ^ ~" and (starting in v5.22) "&. |. ^. ~." can operate on bit strings encoded in UTF-8, but this can give unexpected results if any of the strings contain code points above 0xFF. Starting in v5.28, it is a fatal error to have such an operand. Otherwise, the operation is performed on a non-UTF-8 copy of the operand. If you’re not sure about the encoding of a string, downgrade it before using any of these operators; you can use "utf8::utf8_downgrade()".

The bottom line is that Perl has always practiced “Character Semantics”, but with the advent of Unicode, that is now different than “Byte Semantics”.

ASCII Rules versus Unicode Rules

Before Unicode, when a character was a byte was a character, Perl knew only about the 128 characters defined by ASCII, code points 0 through 127 (except for under "use locale"). That left the code points 128 to 255 as unassigned, and available for whatever use a program might want. The only semantics they have is their ordinal numbers, and that they are members of none of the non-negative character classes. None are considered to match "\w" for example, but all match "\W".

Unicode, of course, assigns each of those code points a particular meaning (along with ones above 255). To preserve backward compatibility, Perl only uses the Unicode meanings when there is some indication that Unicode is what is intended; otherwise the non-ASCII code points remain treated as if they are unassigned.

Here are the ways that Perl knows that a string should be treated as Unicode:

•

Within the scope of "use utf8"

If the whole program is Unicode (signified by using 8-bit Unicode Transformation Format), then all literal strings within it must be Unicode.

•

Within the scope of "use feature 'unicode_strings'"

This pragma was created so you can explicitly tell Perl that operations executed within its scope are to use Unicode rules. More operations are affected with newer perls. See “The ”Unicode Bug"".

•

Within the scope of "use 5.012" or higher

This implicitly turns on "use feature 'unicode_strings'".

•

Within the scope of "use locale 'not_characters'", or "use locale" and the current locale is a UTF-8 locale.

The former is defined to imply Unicode handling; and the latter indicates a Unicode locale, hence a Unicode interpretation of all strings within it.

•

When the string contains a Unicode-only code point

Perl has never accepted code points above 255 without them being Unicode, so their use implies Unicode for the whole string.

•

When the string contains a Unicode named code point "\N{...}"

The "\N{...}" construct explicitly refers to a Unicode code point, even if it is one that is also in ASCII. Therefore the string containing it must be Unicode.

•

When the string has come from an external source marked as Unicode

The "-C" command line option can specify that certain inputs to the program are Unicode, and the values of this can be read by your Perl code, see “${^UNICODE}” in perlvar.

•

When the string has been upgraded to UTF-8

The function "utf8::utf8_upgrade()" can be explicitly used to permanently (unless a subsequent "utf8::utf8_downgrade()" is called) cause a string to be treated as Unicode.

•

There are additional methods for regular expression patterns

A pattern that is compiled with the "/u" or "/a" modifiers is treated as Unicode (though there are some restrictions with "/a"). Under the "/d" and "/l" modifiers, there are several other indications for Unicode; see “Character set modifiers” in perlre.

Note that all of the above are overridden within the scope of "use bytes"; but you should be using this pragma only for debugging.

Note also that some interactions with the platform’s operating system never use Unicode rules.

When Unicode rules are in effect:

•

Case translation operators use the Unicode case translation tables.

Note that "uc()", or "

perlunicode (1) Linux Manual Page

NAME

DESCRIPTION

Important Caveats

Byte and Character Semantics

ASCII Rules versus Unicode Rules

Leave a Reply Cancel reply