perlunicook (1) Linux Manual Page
NAME
perlunicook – cookbookish examples of handling Unicode in Perl
DESCRIPTION
This manpage contains short recipes demonstrating how to handle common Unicode operations in Perl, plus one complete program at the end. Any undeclared variables in individual recipes are assumed to have a previous appropriate value in them.
EXAMPLES
℞ 0: Standard preamble
Unless otherwise notes, all examples below require this standard preamble to work correctly, with the "#!" adjusted to work on your system:
#!/ usr / bin / env perl
use utf8;
#so literals and identifiers can be in UTF - 8 use v5 .12;
# or later to get "unicode_strings" feature
use strict;
#quote strings, declare variables
use warnings;
#on by default use warnings qw(FATAL utf8);
#fatalize encoding glitches
use open qw(
: std
: encoding(UTF - 8));
#undeclared streams in UTF - 8 use charnames qw(
: full
: short);
#unneeded in v5 .16
This does make even Unix programmers "binmode" your binary streams, or open them with ":raw", but that’s the only way to get at them portably anyway.
WARNING: "use autodie" (pre 2.26) and "use open" do not get along with each other.
℞ 1: Generic Unicode-savvy filter
Always decompose on the way in, then recompose on the way out.
use Unicode::Normalize;
while (<>) {
$_ = NFD($_);
#decompose + reorder canonically…
}
continue
{
print NFC($_);
#recompose(where possible) + reorder canonically
}
℞ 2: Fine-tuning Unicode warnings
As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
use v5 .14;
#subwarnings unavailable any earlier
no warnings “nonchar”;
#the 66 forbidden non – characters
no warnings “surrogate”;
#UTF – 16 / CESU – 8 nonsense no warnings “non_unicode”;
#for codepoints over 0x10_FFFF
℞ 3: Declare source in utf8 for identifiers and literals
Without the all-critical "use utf8" declaration, putting UTF‑8 in your literals and identifiers won’t work right. If you used the standard preamble just given above, this already happened. If you did, you can do things like this:
use utf8;
my $measure = “Ångström”;
my @μsoft = qw(cp852 cp1251 cp1252);
my @ὑπέρμεγας = qw(ὑπέρ μεγας);
my @鯉 = qw(koi8 – f koi8 – u koi8 – r);
my $motto = ” “;
#FAMILY, GROWING HEART, DROMEDARY CAMEL
If you forget "use utf8", high bytes will be misunderstood as separate characters, and nothing will work right.
℞ 4: Characters and their numbers
The "ord" and "chr" functions work transparently on all codepoints, not just on ASCII alone — nor in fact, not even just on Unicode alone.
# ASCII characters
ord("A")
chr(65)
# characters from the Basic Multilingual Plane
ord("Σ")
chr(0x3A3)
# beyond the BMP
ord("") # MATHEMATICAL ITALIC SMALL N
chr(0x1D45B)
# beyond Unicode! (up to MAXINT)
ord("
