perl592delta (1) Linux Manual Page
NAME
perl592delta – what is new for perl v5.9.2
DESCRIPTION
This document describes differences between the 5.9.1 and the 5.9.2 development releases. See perl590delta and perl591delta for the differences between 5.8.0 and 5.9.1.
Incompatible Changes
Packing and UTF-8 strings
The semantics of pack() and unpack() regarding UTF-8-encoded data has been changed. Processing is now by default character per character instead of byte per byte on the underlying encoding. Notably, code that used things like "pack("a*", $string)" to see through the encoding of string will now simply get back the original $string. Packed strings can also get upgraded during processing when you store upgraded characters. You can get the old behaviour by using "use bytes".
To be consistent with pack(), the "C0" in unpack() templates indicates that the data is to be processed in character mode, i.e. character by character; on the contrary, "U0" in unpack() indicates UTF-8 mode, where the packed string is processed in its UTF-8-encoded Unicode form on a byte by byte basis. This is reversed with regard to perl 5.8.X.
Moreover, "C0" and "U0" can also be used in pack() templates to specify respectively character and byte modes.
"C0" and "U0" in the middle of a pack or unpack format now switch to the specified encoding mode, honoring parens grouping. Previously, parens were ignored.
Also, there is a new pack() character format, "W", which is intended to replace the old "C". "C" is kept for unsigned chars coded as bytes in the strings internal representation. "W" represents unsigned (logical) character values, which can be greater than 255. It is therefore more robust when dealing with potentially UTF-8-encoded data (as "C" will wrap values outside the range 0..255, and not respect the string encoding).
In practice, that means that pack formats are now encoding-neutral, except "C".
For consistency, "A" in unpack() format now trims all Unicode whitespace from the end of the string. Before perl 5.9.2, it used to strip only the classical ASCII space characters.
Miscellaneous
The internal dump output has been improved, so that non-printable characters such as newline and backspace are output in "
