distrib/pod/perlunicode.pod

1N/A=head1 NAME
1N/A
1N/Aperlunicode - Unicode support in Perl
1N/A
1N/A=head1 DESCRIPTION
1N/A
1N/A=head2 Important Caveats
1N/A
1N/AUnicode support is an extensive requirement. While Perl does not
1N/Aimplement the Unicode standard or the accompanying technical reports
1N/Afrom cover to cover, Perl does support many Unicode features.
1N/A
1N/A=over 4
1N/A
1N/A=item Input and Output Layers
1N/A
1N/APerl knows when a filehandle uses Perl's internal Unicode encodings
1N/A(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
1N/Athe ":utf8" layer.  Other encodings can be converted to Perl's
1N/Aencoding on input or from Perl's encoding on output by use of the
1N/A":encoding(...)"  layer.  See L<open>.
1N/A
1N/ATo indicate that Perl source itself is using a particular encoding,
1N/Asee L<encoding>.
1N/A
1N/A=item Regular Expressions
1N/A
1N/AThe regular expression compiler produces polymorphic opcodes.  That is,
1N/Athe pattern adapts to the data and automatically switches to the Unicode
1N/Acharacter scheme when presented with Unicode data--or instead uses
1N/Aa traditional byte scheme when presented with byte data.
1N/A
1N/A=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
1N/A
1N/AAs a compatibility measure, the C<use utf8> pragma must be explicitly
1N/Aincluded to enable recognition of UTF-8 in the Perl scripts themselves
1N/A(in string or regular expression literals, or in identifier names) on
1N/AASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
1N/Amachines.  B<These are the only times when an explicit C<use utf8>
1N/Ais needed.>  See L<utf8>.
1N/A
1N/AYou can also use the C<encoding> pragma to change the default encoding
1N/Aof the data in your script; see L<encoding>.
1N/A
1N/A=item BOM-marked scripts and UTF-16 scripts autodetected
1N/A
1N/AIf a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
1N/Aor UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
1N/Aendianness, Perl will correctly read in the script as Unicode.
1N/A(BOMless UTF-8 cannot be effectively recognized or differentiated from
1N/AISO 8859-1 or other eight-bit encodings.)
1N/A
1N/A=item C<use encoding> needed to upgrade non-Latin-1 byte strings
1N/A
1N/ABy default, there is a fundamental asymmetry in Perl's unicode model:
1N/Aimplicit upgrading from byte strings to Unicode strings assumes that
1N/Athey were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
1N/Adowngraded with UTF-8 encoding.  This happens because the first 256
1N/Acodepoints in Unicode happens to agree with Latin-1.
1N/A
1N/AIf you wish to interpret byte strings as UTF-8 instead, use the
1N/AC<encoding> pragma:
1N/A
1N/A    use encoding 'utf8';
1N/A
1N/ASee L</"Byte and Character Semantics"> for more details.
1N/A
1N/A=back
1N/A
1N/A=head2 Byte and Character Semantics
1N/A
1N/ABeginning with version 5.6, Perl uses logically-wide characters to
1N/Arepresent strings internally.
1N/A
1N/AIn future, Perl-level operations will be expected to work with
1N/Acharacters rather than bytes.
1N/A
1N/AHowever, as an interim compatibility measure, Perl aims to
1N/Aprovide a safe migration path from byte semantics to character
1N/Asemantics for programs.  For operations where Perl can unambiguously
1N/Adecide that the input data are characters, Perl switches to
1N/Acharacter semantics.  For operations where this determination cannot
1N/Abe made without additional information from the user, Perl decides in
1N/Afavor of compatibility and chooses to use byte semantics.
1N/A
1N/AThis behavior preserves compatibility with earlier versions of Perl,
1N/Awhich allowed byte semantics in Perl operations only if
1N/Anone of the program's inputs were marked as being as source of Unicode
1N/Acharacter data.  Such data may come from filehandles, from calls to
1N/Aexternal programs, from information provided by the system (such as %ENV),
1N/Aor from literals and constants in the source text.
1N/A
1N/AThe C<bytes> pragma will always, regardless of platform, force byte
1N/Asemantics in a particular lexical scope.  See L<bytes>.
1N/A
1N/AThe C<utf8> pragma is primarily a compatibility device that enables
1N/Arecognition of UTF-(8|EBCDIC) in literals encountered by the parser.
1N/ANote that this pragma is only required while Perl defaults to byte
1N/Asemantics; when character semantics become the default, this pragma
1N/Amay become a no-op.  See L<utf8>.
1N/A
1N/AUnless explicitly stated, Perl operators use character semantics
1N/Afor Unicode data and byte semantics for non-Unicode data.
1N/AThe decision to use character semantics is made transparently.  If
1N/Ainput data comes from a Unicode source--for example, if a character
1N/Aencoding layer is added to a filehandle or a literal Unicode
1N/Astring constant appears in a program--character semantics apply.
1N/AOtherwise, byte semantics are in effect.  The C<bytes> pragma should
1N/Abe used to force byte semantics on Unicode data.
1N/A
1N/AIf strings operating under byte semantics and strings with Unicode
1N/Acharacter data are concatenated, the new string will be created by
1N/Adecoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1N/Aold Unicode string used EBCDIC.  This translation is done without
1N/Aregard to the system's native 8-bit encoding.  To change this for
1N/Asystems with non-Latin-1 and non-EBCDIC native encodings, use the
1N/AC<encoding> pragma.  See L<encoding>.
1N/A
1N/AUnder character semantics, many operations that formerly operated on
1N/Abytes now operate on characters. A character in Perl is
1N/Alogically just a number ranging from 0 to 2**31 or so. Larger
1N/Acharacters may encode into longer sequences of bytes internally, but
1N/Athis internal detail is mostly hidden for Perl code.
1N/ASee L<perluniintro> for more.
1N/A
1N/A=head2 Effects of Character Semantics
1N/A
1N/ACharacter semantics have the following effects:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AStrings--including hash keys--and regular expression patterns may
1N/Acontain characters that have an ordinal value larger than 255.
1N/A
1N/AIf you use a Unicode editor to edit your program, Unicode characters
1N/Amay occur directly within the literal strings in one of the various
1N/AUnicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
1N/Aas such and converted to Perl's internal representation only if the
1N/Aappropriate L<encoding> is specified.
1N/A
1N/AUnicode characters can also be added to a string by using the
1N/AC<\x{...}> notation.  The Unicode code for the desired character, in
1N/Ahexadecimal, should be placed in the braces. For instance, a smiley
1N/Aface is C<\x{263A}>.  This encoding scheme only works for characters
1N/Awith a code of 0x100 or above.
1N/A
1N/AAdditionally, if you
1N/A
1N/A   use charnames ':full';
1N/A
1N/Ayou can use the C<\N{...}> notation and put the official Unicode
1N/Acharacter name within the braces, such as C<\N{WHITE SMILING FACE}>.
1N/A
1N/A
1N/A=item *
1N/A
1N/AIf an appropriate L<encoding> is specified, identifiers within the
1N/APerl script may contain Unicode alphanumeric characters, including
1N/Aideographs.  Perl does not currently attempt to canonicalize variable
1N/Anames.
1N/A
1N/A=item *
1N/A
1N/ARegular expressions match characters instead of bytes.  "." matches
1N/Aa character instead of a byte.  The C<\C> pattern is provided to force
1N/Aa match a single byte--a C<char> in C, hence C<\C>.
1N/A
1N/A=item *
1N/A
1N/ACharacter classes in regular expressions match characters instead of
1N/Abytes and match against the character properties specified in the
1N/AUnicode properties database.  C<\w> can be used to match a Japanese
1N/Aideograph, for instance.
1N/A
1N/A(However, and as a limitation of the current implementation, using
1N/AC<\w> or C<\W> I<inside> a C<[...]> character class will still match
1N/Awith byte semantics.)
1N/A
1N/A=item *
1N/A
1N/ANamed Unicode properties, scripts, and block ranges may be used like
1N/Acharacter classes via the C<\p{}> "matches property" construct and
1N/Athe  C<\P{}> negation, "doesn't match property".
1N/A
1N/AFor instance, C<\p{Lu}> matches any character with the Unicode "Lu"
1N/A(Letter, uppercase) property, while C<\p{M}> matches any character
1N/Awith an "M" (mark--accents and such) property.  Brackets are not
1N/Arequired for single letter properties, so C<\p{M}> is equivalent to
1N/AC<\pM>. Many predefined properties are available, such as
1N/AC<\p{Mirrored}> and C<\p{Tibetan}>.
1N/A
1N/AThe official Unicode script and block names have spaces and dashes as
1N/Aseparators, but for convenience you can use dashes, spaces, or
1N/Aunderbars, and case is unimportant. It is recommended, however, that
1N/Afor consistency you use the following naming: the official Unicode
1N/Ascript, property, or block name (see below for the additional rules
1N/Athat apply to block names) with whitespace and dashes removed, and the
1N/Awords "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
1N/Abecomes C<Latin1Supplement>.
1N/A
1N/AYou can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
1N/A(^) between the first brace and the property name: C<\p{^Tamil}> is
1N/Aequal to C<\P{Tamil}>.
1N/A
1N/AB<NOTE: the properties, scripts, and blocks listed here are as of
1N/AUnicode 3.2.0, March 2002, or Perl 5.8.0, July 2002.  Unicode 4.0.0
1N/Acame out in April 2003, and Perl 5.8.1 in September 2003.>
1N/A
1N/AHere are the basic Unicode General Category properties, followed by their
1N/Along form.  You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
1N/Afor instance, are identical.
1N/A
1N/A    Short       Long
1N/A
1N/A    L           Letter
1N/A    Lu          UppercaseLetter
1N/A    Ll          LowercaseLetter
1N/A    Lt          TitlecaseLetter
1N/A    Lm          ModifierLetter
1N/A    Lo          OtherLetter
1N/A
1N/A    M           Mark
1N/A    Mn          NonspacingMark
1N/A    Mc          SpacingMark
1N/A    Me          EnclosingMark
1N/A
1N/A    N           Number
1N/A    Nd          DecimalNumber
1N/A    Nl          LetterNumber
1N/A    No          OtherNumber
1N/A
1N/A    P           Punctuation
1N/A    Pc          ConnectorPunctuation
1N/A    Pd          DashPunctuation
1N/A    Ps          OpenPunctuation
1N/A    Pe          ClosePunctuation
1N/A    Pi          InitialPunctuation
1N/A                (may behave like Ps or Pe depending on usage)
1N/A    Pf          FinalPunctuation
1N/A                (may behave like Ps or Pe depending on usage)
1N/A    Po          OtherPunctuation
1N/A
1N/A    S           Symbol
1N/A    Sm          MathSymbol
1N/A    Sc          CurrencySymbol
1N/A    Sk          ModifierSymbol
1N/A    So          OtherSymbol
1N/A
1N/A    Z           Separator
1N/A    Zs          SpaceSeparator
1N/A    Zl          LineSeparator
1N/A    Zp          ParagraphSeparator
1N/A
1N/A    C           Other
1N/A    Cc          Control
1N/A    Cf          Format
1N/A    Cs          Surrogate   (not usable)
1N/A    Co          PrivateUse
1N/A    Cn          Unassigned
1N/A
1N/ASingle-letter properties match all characters in any of the
1N/Atwo-letter sub-properties starting with the same letter.
1N/AC<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
1N/A
1N/ABecause Perl hides the need for the user to understand the internal
1N/Arepresentation of Unicode characters, there is no need to implement
1N/Athe somewhat messy concept of surrogates. C<Cs> is therefore not
1N/Asupported.
1N/A
1N/ABecause scripts differ in their directionality--Hebrew is
1N/Awritten right to left, for example--Unicode supplies these properties:
1N/A
1N/A    Property    Meaning
1N/A
1N/A    BidiL       Left-to-Right
1N/A    BidiLRE     Left-to-Right Embedding
1N/A    BidiLRO     Left-to-Right Override
1N/A    BidiR       Right-to-Left
1N/A    BidiAL      Right-to-Left Arabic
1N/A    BidiRLE     Right-to-Left Embedding
1N/A    BidiRLO     Right-to-Left Override
1N/A    BidiPDF     Pop Directional Format
1N/A    BidiEN      European Number
1N/A    BidiES      European Number Separator
1N/A    BidiET      European Number Terminator
1N/A    BidiAN      Arabic Number
1N/A    BidiCS      Common Number Separator
1N/A    BidiNSM     Non-Spacing Mark
1N/A    BidiBN      Boundary Neutral
1N/A    BidiB       Paragraph Separator
1N/A    BidiS       Segment Separator
1N/A    BidiWS      Whitespace
1N/A    BidiON      Other Neutrals
1N/A
1N/AFor example, C<\p{BidiR}> matches characters that are normally
1N/Awritten right to left.
1N/A
1N/A=back
1N/A
1N/A=head2 Scripts
1N/A
1N/AThe script names which can be used by C<\p{...}> and C<\P{...}>,
1N/Asuch as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
1N/A
1N/A    Arabic
1N/A    Armenian
1N/A    Bengali
1N/A    Bopomofo
1N/A    Buhid
1N/A    CanadianAboriginal
1N/A    Cherokee
1N/A    Cyrillic
1N/A    Deseret
1N/A    Devanagari
1N/A    Ethiopic
1N/A    Georgian
1N/A    Gothic
1N/A    Greek
1N/A    Gujarati
1N/A    Gurmukhi
1N/A    Han
1N/A    Hangul
1N/A    Hanunoo
1N/A    Hebrew
1N/A    Hiragana
1N/A    Inherited
1N/A    Kannada
1N/A    Katakana
1N/A    Khmer
1N/A    Lao
1N/A    Latin
1N/A    Malayalam
1N/A    Mongolian
1N/A    Myanmar
1N/A    Ogham
1N/A    OldItalic
1N/A    Oriya
1N/A    Runic
1N/A    Sinhala
1N/A    Syriac
1N/A    Tagalog
1N/A    Tagbanwa
1N/A    Tamil
1N/A    Telugu
1N/A    Thaana
1N/A    Thai
1N/A    Tibetan
1N/A    Yi
1N/A
1N/AExtended property classes can supplement the basic
1N/Aproperties, defined by the F<PropList> Unicode database:
1N/A
1N/A    ASCIIHexDigit
1N/A    BidiControl
1N/A    Dash
1N/A    Deprecated
1N/A    Diacritic
1N/A    Extender
1N/A    GraphemeLink
1N/A    HexDigit
1N/A    Hyphen
1N/A    Ideographic
1N/A    IDSBinaryOperator
1N/A    IDSTrinaryOperator
1N/A    JoinControl
1N/A    LogicalOrderException
1N/A    NoncharacterCodePoint
1N/A    OtherAlphabetic
1N/A    OtherDefaultIgnorableCodePoint
1N/A    OtherGraphemeExtend
1N/A    OtherLowercase
1N/A    OtherMath
1N/A    OtherUppercase
1N/A    QuotationMark
1N/A    Radical
1N/A    SoftDotted
1N/A    TerminalPunctuation
1N/A    UnifiedIdeograph
1N/A    WhiteSpace
1N/A
1N/Aand there are further derived properties:
1N/A
1N/A    Alphabetic      Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
1N/A    Lowercase       Ll + OtherLowercase
1N/A    Uppercase       Lu + OtherUppercase
1N/A    Math            Sm + OtherMath
1N/A
1N/A    ID_Start        Lu + Ll + Lt + Lm + Lo + Nl
1N/A    ID_Continue     ID_Start + Mn + Mc + Nd + Pc
1N/A
1N/A    Any             Any character
1N/A    Assigned        Any non-Cn character (i.e. synonym for \P{Cn})
1N/A    Unassigned      Synonym for \p{Cn}
1N/A    Common          Any character (or unassigned code point)
1N/A                    not explicitly assigned to a script
1N/A
1N/AFor backward compatibility (with Perl 5.6), all properties mentioned
1N/Aso far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
1N/Aexample, is equal to C<\P{Lu}>.
1N/A
1N/A=head2 Blocks
1N/A
1N/AIn addition to B<scripts>, Unicode also defines B<blocks> of
1N/Acharacters.  The difference between scripts and blocks is that the
1N/Aconcept of scripts is closer to natural languages, while the concept
1N/Aof blocks is more of an artificial grouping based on groups of 256
1N/AUnicode characters. For example, the C<Latin> script contains letters
1N/Afrom many blocks but does not contain all the characters from those
1N/Ablocks. It does not, for example, contain digits, because digits are
1N/Ashared across many scripts. Digits and similar groups, like
1N/Apunctuation, are in a category called C<Common>.
1N/A
1N/AFor more about scripts, see the UTR #24:
1N/A
1N/A   http://www.unicode.org/unicode/reports/tr24/
1N/A
1N/AFor more about blocks, see:
1N/A
1N/A   http://www.unicode.org/Public/UNIDATA/Blocks.txt
1N/A
1N/ABlock names are given with the C<In> prefix. For example, the
1N/AKatakana block is referenced via C<\p{InKatakana}>.  The C<In>
1N/Aprefix may be omitted if there is no naming conflict with a script
1N/Aor any other property, but it is recommended that C<In> always be used
1N/Afor block tests to avoid confusion.
1N/A
1N/AThese block names are supported:
1N/A
1N/A    InAlphabeticPresentationForms
1N/A    InArabic
1N/A    InArabicPresentationFormsA
1N/A    InArabicPresentationFormsB
1N/A    InArmenian
1N/A    InArrows
1N/A    InBasicLatin
1N/A    InBengali
1N/A    InBlockElements
1N/A    InBopomofo
1N/A    InBopomofoExtended
1N/A    InBoxDrawing
1N/A    InBraillePatterns
1N/A    InBuhid
1N/A    InByzantineMusicalSymbols
1N/A    InCJKCompatibility
1N/A    InCJKCompatibilityForms
1N/A    InCJKCompatibilityIdeographs
1N/A    InCJKCompatibilityIdeographsSupplement
1N/A    InCJKRadicalsSupplement
1N/A    InCJKSymbolsAndPunctuation
1N/A    InCJKUnifiedIdeographs
1N/A    InCJKUnifiedIdeographsExtensionA
1N/A    InCJKUnifiedIdeographsExtensionB
1N/A    InCherokee
1N/A    InCombiningDiacriticalMarks
1N/A    InCombiningDiacriticalMarksforSymbols
1N/A    InCombiningHalfMarks
1N/A    InControlPictures
1N/A    InCurrencySymbols
1N/A    InCyrillic
1N/A    InCyrillicSupplementary
1N/A    InDeseret
1N/A    InDevanagari
1N/A    InDingbats
1N/A    InEnclosedAlphanumerics
1N/A    InEnclosedCJKLettersAndMonths
1N/A    InEthiopic
1N/A    InGeneralPunctuation
1N/A    InGeometricShapes
1N/A    InGeorgian
1N/A    InGothic
1N/A    InGreekExtended
1N/A    InGreekAndCoptic
1N/A    InGujarati
1N/A    InGurmukhi
1N/A    InHalfwidthAndFullwidthForms
1N/A    InHangulCompatibilityJamo
1N/A    InHangulJamo
1N/A    InHangulSyllables
1N/A    InHanunoo
1N/A    InHebrew
1N/A    InHighPrivateUseSurrogates
1N/A    InHighSurrogates
1N/A    InHiragana
1N/A    InIPAExtensions
1N/A    InIdeographicDescriptionCharacters
1N/A    InKanbun
1N/A    InKangxiRadicals
1N/A    InKannada
1N/A    InKatakana
1N/A    InKatakanaPhoneticExtensions
1N/A    InKhmer
1N/A    InLao
1N/A    InLatin1Supplement
1N/A    InLatinExtendedA
1N/A    InLatinExtendedAdditional
1N/A    InLatinExtendedB
1N/A    InLetterlikeSymbols
1N/A    InLowSurrogates
1N/A    InMalayalam
1N/A    InMathematicalAlphanumericSymbols
1N/A    InMathematicalOperators
1N/A    InMiscellaneousMathematicalSymbolsA
1N/A    InMiscellaneousMathematicalSymbolsB
1N/A    InMiscellaneousSymbols
1N/A    InMiscellaneousTechnical
1N/A    InMongolian
1N/A    InMusicalSymbols
1N/A    InMyanmar
1N/A    InNumberForms
1N/A    InOgham
1N/A    InOldItalic
1N/A    InOpticalCharacterRecognition
1N/A    InOriya
1N/A    InPrivateUseArea
1N/A    InRunic
1N/A    InSinhala
1N/A    InSmallFormVariants
1N/A    InSpacingModifierLetters
1N/A    InSpecials
1N/A    InSuperscriptsAndSubscripts
1N/A    InSupplementalArrowsA
1N/A    InSupplementalArrowsB
1N/A    InSupplementalMathematicalOperators
1N/A    InSupplementaryPrivateUseAreaA
1N/A    InSupplementaryPrivateUseAreaB
1N/A    InSyriac
1N/A    InTagalog
1N/A    InTagbanwa
1N/A    InTags
1N/A    InTamil
1N/A    InTelugu
1N/A    InThaana
1N/A    InThai
1N/A    InTibetan
1N/A    InUnifiedCanadianAboriginalSyllabics
1N/A    InVariationSelectors
1N/A    InYiRadicals
1N/A    InYiSyllables
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AThe special pattern C<\X> matches any extended Unicode
1N/Asequence--"a combining character sequence" in Standardese--where the
1N/Afirst character is a base character and subsequent characters are mark
1N/Acharacters that apply to the base character.  C<\X> is equivalent to
1N/AC<(?:\PM\pM*)>.
1N/A
1N/A=item *
1N/A
1N/AThe C<tr///> operator translates characters instead of bytes.  Note
1N/Athat the C<tr///CU> functionality has been removed.  For similar
1N/Afunctionality see pack('U0', ...) and pack('C0', ...).
1N/A
1N/A=item *
1N/A
1N/ACase translation operators use the Unicode case translation tables
1N/Awhen character input is provided.  Note that C<uc()>, or C<\U> in
1N/Ainterpolated strings, translates to uppercase, while C<ucfirst>,
1N/Aor C<\u> in interpolated strings, translates to titlecase in languages
1N/Athat make the distinction.
1N/A
1N/A=item *
1N/A
1N/AMost operators that deal with positions or lengths in a string will
1N/Aautomatically switch to using character positions, including
1N/AC<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
1N/AC<sprintf()>, C<write()>, and C<length()>.  Operators that
1N/Aspecifically do not switch include C<vec()>, C<pack()>, and
1N/AC<unpack()>.  Operators that really don't care include
1N/Aoperators that treats strings as a bucket of bits such as C<sort()>,
1N/Aand operators dealing with filenames.
1N/A
1N/A=item *
1N/A
1N/AThe C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
1N/Asince they are often used for byte-oriented formats.  Again, think
1N/AC<char> in the C language.
1N/A
1N/AThere is a new C<U> specifier that converts between Unicode characters
1N/Aand code points.
1N/A
1N/A=item *
1N/A
1N/AThe C<chr()> and C<ord()> functions work on characters, similar to
1N/AC<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
1N/AC<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
1N/Aemulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
1N/AWhile these methods reveal the internal encoding of Unicode strings,
1N/Athat is not something one normally needs to care about at all.
1N/A
1N/A=item *
1N/A
1N/AThe bit string operators, C<& | ^ ~>, can operate on character data.
1N/AHowever, for backward compatibility, such as when using bit string
1N/Aoperations when characters are all less than 256 in ordinal value, one
1N/Ashould not use C<~> (the bit complement) with characters of both
1N/Avalues less than 256 and values greater than 256.  Most importantly,
1N/ADeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
1N/Awill not hold.  The reason for this mathematical I<faux pas> is that
1N/Athe complement cannot return B<both> the 8-bit (byte-wide) bit
1N/Acomplement B<and> the full character-wide bit complement.
1N/A
1N/A=item *
1N/A
1N/Alc(), uc(), lcfirst(), and ucfirst() work for the following cases:
1N/A
1N/A=over 8
1N/A
1N/A=item *
1N/A
1N/Athe case mapping is from a single Unicode character to another
1N/Asingle Unicode character, or
1N/A
1N/A=item *
1N/A
1N/Athe case mapping is from a single Unicode character to more
1N/Athan one Unicode character.
1N/A
1N/A=back
1N/A
1N/AThings to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
1N/Asince Perl does not understand the concept of Unicode locales.
1N/A
1N/ASee the Unicode Technical Report #21, Case Mappings, for more details.
1N/A
1N/A=back
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AAnd finally, C<scalar reverse()> reverses by character rather than by byte.
1N/A
1N/A=back
1N/A
1N/A=head2 User-Defined Character Properties
1N/A
1N/AYou can define your own character properties by defining subroutines
1N/Awhose names begin with "In" or "Is".  The subroutines must be defined
1N/Ain the C<main> package.  The user-defined properties can be used in the
1N/Aregular expression C<\p> and C<\P> constructs.  Note that the effect
1N/Ais compile-time and immutable once defined.
1N/A
1N/AThe subroutines must return a specially-formatted string, with one
1N/Aor more newline-separated lines.  Each line must be one of the following:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/ATwo hexadecimal numbers separated by horizontal whitespace (space or
1N/Atabular characters) denoting a range of Unicode code points to include.
1N/A
1N/A=item *
1N/A
1N/ASomething to include, prefixed by "+": a built-in character
1N/Aproperty (prefixed by "utf8::"), to represent all the characters in that
1N/Aproperty; two hexadecimal code points for a range; or a single
1N/Ahexadecimal code point.
1N/A
1N/A=item *
1N/A
1N/ASomething to exclude, prefixed by "-": an existing character
1N/Aproperty (prefixed by "utf8::"), for all the characters in that
1N/Aproperty; two hexadecimal code points for a range; or a single
1N/Ahexadecimal code point.
1N/A
1N/A=item *
1N/A
1N/ASomething to negate, prefixed "!": an existing character
1N/Aproperty (prefixed by "utf8::") for all the characters except the
1N/Acharacters in the property; two hexadecimal code points for a range;
1N/Aor a single hexadecimal code point.
1N/A
1N/A=back
1N/A
1N/AFor example, to define a property that covers both the Japanese
1N/Asyllabaries (hiragana and katakana), you can define
1N/A
1N/A    sub InKana {
1N/A    return <<END;
1N/A    3040\t309F
1N/A    30A0\t30FF
1N/A    END
1N/A    }
1N/A
1N/AImagine that the here-doc end marker is at the beginning of the line.
1N/ANow you can use C<\p{InKana}> and C<\P{InKana}>.
1N/A
1N/AYou could also have used the existing block property names:
1N/A
1N/A    sub InKana {
1N/A    return <<'END';
1N/A    +utf8::InHiragana
1N/A    +utf8::InKatakana
1N/A    END
1N/A    }
1N/A
1N/ASuppose you wanted to match only the allocated characters,
1N/Anot the raw block ranges: in other words, you want to remove
1N/Athe non-characters:
1N/A
1N/A    sub InKana {
1N/A    return <<'END';
1N/A    +utf8::InHiragana
1N/A    +utf8::InKatakana
1N/A    -utf8::IsCn
1N/A    END
1N/A    }
1N/A
1N/AThe negation is useful for defining (surprise!) negated classes.
1N/A
1N/A    sub InNotKana {
1N/A    return <<'END';
1N/A    !utf8::InHiragana
1N/A    -utf8::InKatakana
1N/A    +utf8::IsCn
1N/A    END
1N/A    }
1N/A
1N/AYou can also define your own mappings to be used in the lc(),
1N/Alcfirst(), uc(), and ucfirst() (or their string-inlined versions).
1N/AThe principle is the same: define subroutines in the C<main> package
1N/Awith names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
1N/Athe first character in ucfirst()), and C<ToUpper> (for uc(), and the
1N/Arest of the characters in ucfirst()).
1N/A
1N/AThe string returned by the subroutines needs now to be three
1N/Ahexadecimal numbers separated by tabulators: start of the source
1N/Arange, end of the source range, and start of the destination range.
1N/AFor example:
1N/A
1N/A    sub ToUpper {
1N/A    return <<END;
1N/A    0061\t0063\t0041
1N/A    END
1N/A    }
1N/A
1N/Adefines an uc() mapping that causes only the characters "a", "b", and
1N/A"c" to be mapped to "A", "B", "C", all other characters will remain
1N/Aunchanged.
1N/A
1N/AIf there is no source range to speak of, that is, the mapping is from
1N/Aa single character to another single character, leave the end of the
1N/Asource range empty, but the two tabulator characters are still needed.
1N/AFor example:
1N/A
1N/A    sub ToLower {
1N/A    return <<END;
1N/A    0041\t\t0061
1N/A    END
1N/A    }
1N/A
1N/Adefines a lc() mapping that causes only "A" to be mapped to "a", all
1N/Aother characters will remain unchanged.
1N/A
1N/A(For serious hackers only)  If you want to introspect the default
1N/Amappings, you can find the data in the directory
1N/AC<$Config{privlib}>/F<unicore/To/>.  The mapping data is returned as
1N/Athe here-document, and the C<utf8::ToSpecFoo> are special exception
1N/Amappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
1N/AThe C<Digit> and C<Fold> mappings that one can see in the directory
1N/Aare not directly user-accessible, one can use either the
1N/AC<Unicode::UCD> module, or just match case-insensitively (that's when
1N/Athe C<Fold> mapping is used).
1N/A
1N/AA final note on the user-defined property tests and mappings: they
1N/Awill be used only if the scalar has been marked as having Unicode
1N/Acharacters.  Old byte-style strings will not be affected.
1N/A
1N/A=head2 Character Encodings for Input and Output
1N/A
1N/ASee L<Encode>.
1N/A
1N/A=head2 Unicode Regular Expression Support Level
1N/A
1N/AThe following list of Unicode support for regular expressions describes
1N/Aall the features currently supported.  The references to "Level N"
1N/Aand the section numbers refer to the Unicode Technical Report 18,
1N/A"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
1N/APerl 5.8.0).
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/ALevel 1 - Basic Unicode Support
1N/A
1N/A        2.1 Hex Notation                        - done          [1]
1N/A            Named Notation                      - done          [2]
1N/A        2.2 Categories                          - done          [3][4]
1N/A        2.3 Subtraction                         - MISSING       [5][6]
1N/A        2.4 Simple Word Boundaries              - done          [7]
1N/A        2.5 Simple Loose Matches                - done          [8]
1N/A        2.6 End of Line                         - MISSING       [9][10]
1N/A
1N/A        [ 1] \x{...}
1N/A        [ 2] \N{...}
1N/A        [ 3] . \p{...} \P{...}
1N/A        [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
1N/A        [ 5] have negation
1N/A        [ 6] can use regular expression look-ahead [a]
1N/A             or user-defined character properties [b] to emulate subtraction
1N/A        [ 7] include Letters in word characters
1N/A        [ 8] note that Perl does Full case-folding in matching, not Simple:
1N/A             for example U+1F88 is equivalent with U+1F00 U+03B9,
1N/A             not with 1F80.  This difference matters for certain Greek
1N/A             capital letters with certain modifiers: the Full case-folding
1N/A             decomposes the letter, while the Simple case-folding would map
1N/A             it to a single character.
1N/A        [ 9] see UTR #13 Unicode Newline Guidelines
1N/A        [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
1N/A             (should also affect <>, $., and script line numbers)
1N/A             (the \x{85}, \x{2028} and \x{2029} do match \s)
1N/A
1N/A[a] You can mimic class subtraction using lookahead.
1N/AFor example, what UTR #18 might write as
1N/A
1N/A    [{Greek}-[{UNASSIGNED}]]
1N/A
1N/Ain Perl can be written as:
1N/A
1N/A    (?!\p{Unassigned})\p{InGreekAndCoptic}
1N/A    (?=\p{Assigned})\p{InGreekAndCoptic}
1N/A
1N/ABut in this particular example, you probably really want
1N/A
1N/A    \p{GreekAndCoptic}
1N/A
1N/Awhich will match assigned characters known to be part of the Greek script.
1N/A
1N/AAlso see the Unicode::Regex::Set module, it does implement the full
1N/AUTR #18 grouping, intersection, union, and removal (subtraction) syntax.
1N/A
1N/A[b] See L</"User-Defined Character Properties">.
1N/A
1N/A=item *
1N/A
1N/ALevel 2 - Extended Unicode Support
1N/A
1N/A        3.1 Surrogates                          - MISSING   [11]
1N/A        3.2 Canonical Equivalents               - MISSING       [12][13]
1N/A        3.3 Locale-Independent Graphemes        - MISSING       [14]
1N/A        3.4 Locale-Independent Words            - MISSING       [15]
1N/A        3.5 Locale-Independent Loose Matches    - MISSING       [16]
1N/A
1N/A        [11] Surrogates are solely a UTF-16 concept and Perl's internal
1N/A             representation is UTF-8.  The Encode module does UTF-16, though.
1N/A        [12] see UTR#15 Unicode Normalization
1N/A        [13] have Unicode::Normalize but not integrated to regexes
1N/A        [14] have \X but at this level . should equal that
1N/A        [15] need three classes, not just \w and \W
1N/A        [16] see UTR#21 Case Mappings
1N/A
1N/A=item *
1N/A
1N/ALevel 3 - Locale-Sensitive Support
1N/A
1N/A        4.1 Locale-Dependent Categories         - MISSING
1N/A        4.2 Locale-Dependent Graphemes          - MISSING       [16][17]
1N/A        4.3 Locale-Dependent Words              - MISSING
1N/A        4.4 Locale-Dependent Loose Matches      - MISSING
1N/A        4.5 Locale-Dependent Ranges             - MISSING
1N/A
1N/A        [16] see UTR#10 Unicode Collation Algorithms
1N/A        [17] have Unicode::Collate but not integrated to regexes
1N/A
1N/A=back
1N/A
1N/A=head2 Unicode Encodings
1N/A
1N/AUnicode characters are assigned to I<code points>, which are abstract
1N/Anumbers.  To use these numbers, various encodings are needed.
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AUTF-8
1N/A
1N/AUTF-8 is a variable-length (1 to 6 bytes, current character allocations
1N/Arequire 4 bytes), byte-order independent encoding. For ASCII (and we
1N/Areally do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
1N/Atransparent.
1N/A
1N/AThe following table is from Unicode 3.2.
1N/A
1N/A Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
1N/A
1N/A   U+0000..U+007F       00..7F
1N/A   U+0080..U+07FF       C2..DF    80..BF
1N/A   U+0800..U+0FFF       E0        A0..BF    80..BF
1N/A   U+1000..U+CFFF       E1..EC    80..BF    80..BF
1N/A   U+D000..U+D7FF       ED        80..9F    80..BF
1N/A   U+D800..U+DFFF       ******* ill-formed *******
1N/A   U+E000..U+FFFF       EE..EF    80..BF    80..BF
1N/A  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
1N/A  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
1N/A U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
1N/A
1N/ANote the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
1N/AC<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
1N/AC<80...8F> in C<U+100000..U+10FFFF>.  The "gaps" are caused by legal
1N/AUTF-8 avoiding non-shortest encodings: it is technically possible to
1N/AUTF-8-encode a single code point in different ways, but that is
1N/Aexplicitly forbidden, and the shortest possible encoding should always
1N/Abe used.  So that's what Perl does.
1N/A
1N/AAnother way to look at it is via bits:
1N/A
1N/A Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
1N/A
1N/A                    0aaaaaaa     0aaaaaaa
1N/A            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
1N/A            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
1N/A  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
1N/A
1N/AAs you can see, the continuation bytes all begin with C<10>, and the
1N/Aleading bits of the start byte tell how many bytes the are in the
1N/Aencoded character.
1N/A
1N/A=item *
1N/A
1N/AUTF-EBCDIC
1N/A
1N/ALike UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1N/A
1N/A=item *
1N/A
1N/AUTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1N/A
1N/AThe followings items are mostly for reference and general Unicode
1N/Aknowledge, Perl doesn't use these constructs internally.
1N/A
1N/AUTF-16 is a 2 or 4 byte encoding.  The Unicode code points
1N/AC<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
1N/Apoints C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
1N/Ausing I<surrogates>, the first 16-bit unit being the I<high
1N/Asurrogate>, and the second being the I<low surrogate>.
1N/A
1N/ASurrogates are code points set aside to encode the C<U+10000..U+10FFFF>
1N/Arange of Unicode code points in pairs of 16-bit units.  The I<high
1N/Asurrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
1N/Aare the range C<U+DC00..U+DFFF>.  The surrogate encoding is
1N/A
1N/A    $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1N/A    $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1N/A
1N/Aand the decoding is
1N/A
1N/A    $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1N/A
1N/AIf you try to generate surrogates (for example by using chr()), you
1N/Awill get a warning if warnings are turned on, because those code
1N/Apoints are not valid for a Unicode character.
1N/A
1N/ABecause of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
1N/Aitself can be used for in-memory computations, but if storage or
1N/Atransfer is required either UTF-16BE (big-endian) or UTF-16LE
1N/A(little-endian) encodings must be chosen.
1N/A
1N/AThis introduces another problem: what if you just know that your data
1N/Ais UTF-16, but you don't know which endianness?  Byte Order Marks, or
1N/ABOMs, are a solution to this.  A special character has been reserved
1N/Ain Unicode to function as a byte order marker: the character with the
1N/Acode point C<U+FEFF> is the BOM.
1N/A
1N/AThe trick is that if you read a BOM, you will know the byte order,
1N/Asince if it was written on a big-endian platform, you will read the
1N/Abytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1N/Ayou will read the bytes C<0xFF 0xFE>.  (And if the originating platform
1N/Awas writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
1N/A
1N/AThe way this trick works is that the character with the code point
1N/AC<U+FFFE> is guaranteed not to be a valid Unicode character, so the
1N/Asequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1N/Alittle-endian format" and cannot be C<U+FFFE>, represented in big-endian
1N/Aformat".
1N/A
1N/A=item *
1N/A
1N/AUTF-32, UTF-32BE, UTF-32LE
1N/A
1N/AThe UTF-32 family is pretty much like the UTF-16 family, expect that
1N/Athe units are 32-bit, and therefore the surrogate scheme is not
1N/Aneeded.  The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
1N/AC<0xFF 0xFE 0x00 0x00> for LE.
1N/A
1N/A=item *
1N/A
1N/AUCS-2, UCS-4
1N/A
1N/AEncodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
1N/Aencoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
1N/Abecause it does not use surrogates.  UCS-4 is a 32-bit encoding,
1N/Afunctionally identical to UTF-32.
1N/A
1N/A=item *
1N/A
1N/AUTF-7
1N/A
1N/AA seven-bit safe (non-eight-bit) encoding, which is useful if the
1N/Atransport or storage is not eight-bit safe.  Defined by RFC 2152.
1N/A
1N/A=back
1N/A
1N/A=head2 Security Implications of Unicode
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AMalformed UTF-8
1N/A
1N/AUnfortunately, the specification of UTF-8 leaves some room for
1N/Ainterpretation of how many bytes of encoded output one should generate
1N/Afrom one input Unicode character.  Strictly speaking, the shortest
1N/Apossible sequence of UTF-8 bytes should be generated,
1N/Abecause otherwise there is potential for an input buffer overflow at
1N/Athe receiving end of a UTF-8 connection.  Perl always generates the
1N/Ashortest length UTF-8, and with warnings on Perl will warn about
1N/Anon-shortest length UTF-8 along with other malformations, such as the
1N/Asurrogates, which are not real Unicode code points.
1N/A
1N/A=item *
1N/A
1N/ARegular expressions behave slightly differently between byte data and
1N/Acharacter (Unicode) data.  For example, the "word character" character
1N/Aclass C<\w> will work differently depending on if data is eight-bit bytes
1N/Aor Unicode.
1N/A
1N/AIn the first case, the set of C<\w> characters is either small--the
1N/Adefault set of alphabetic characters, digits, and the "_"--or, if you
1N/Aare using a locale (see L<perllocale>), the C<\w> might contain a few
1N/Amore letters according to your language and country.
1N/A
1N/AIn the second case, the C<\w> set of characters is much, much larger.
1N/AMost importantly, even in the set of the first 256 characters, it will
1N/Aprobably match different characters: unlike most locales, which are
1N/Aspecific to a language and country pair, Unicode classifies all the
1N/Acharacters that are letters I<somewhere> as C<\w>.  For example, your
1N/Alocale might not think that LATIN SMALL LETTER ETH is a letter (unless
1N/Ayou happen to speak Icelandic), but Unicode does.
1N/A
1N/AAs discussed elsewhere, Perl has one foot (two hooves?) planted in
1N/Aeach of two worlds: the old world of bytes and the new world of
1N/Acharacters, upgrading from bytes to characters when necessary.
1N/AIf your legacy code does not explicitly use Unicode, no automatic
1N/Aswitch-over to characters should happen.  Characters shouldn't get
1N/Adowngraded to bytes, either.  It is possible to accidentally mix bytes
1N/Aand characters, however (see L<perluniintro>), in which case C<\w> in
1N/Aregular expressions might start behaving differently.  Review your
1N/Acode.  Use warnings and the C<strict> pragma.
1N/A
1N/A=back
1N/A
1N/A=head2 Unicode in Perl on EBCDIC
1N/A
1N/AThe way Unicode is handled on EBCDIC platforms is still
1N/Aexperimental.  On such platforms, references to UTF-8 encoding in this
1N/Adocument and elsewhere should be read as meaning the UTF-EBCDIC
1N/Aspecified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
1N/Aare specifically discussed. There is no C<utfebcdic> pragma or
1N/A":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
1N/Athe platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1N/Afor more discussion of the issues.
1N/A
1N/A=head2 Locales
1N/A
1N/AUsually locale settings and Unicode do not affect each other, but
1N/Athere are a couple of exceptions:
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AYou can enable automatic UTF-8-ification of your standard file
1N/Ahandles, default C<open()> layer, and C<@ARGV> by using either
1N/Athe C<-C> command line switch or the C<PERL_UNICODE> environment
1N/Avariable, see L<perlrun> for the documentation of the C<-C> switch.
1N/A
1N/A=item *
1N/A
1N/APerl tries really hard to work both with Unicode and the old
1N/Abyte-oriented world. Most often this is nice, but sometimes Perl's
1N/Astraddling of the proverbial fence causes problems.
1N/A
1N/A=back
1N/A
1N/A=head2 When Unicode Does Not Happen
1N/A
1N/AWhile Perl does have extensive ways to input and output in Unicode,
1N/Aand few other 'entry points' like the @ARGV which can be interpreted
1N/Aas Unicode (UTF-8), there still are many places where Unicode (in some
1N/Aencoding or another) could be given as arguments or received as
1N/Aresults, or both, but it is not.
1N/A
1N/AThe following are such interfaces.  For all of these interfaces Perl
1N/Acurrently (as of 5.8.3) simply assumes byte strings both as arguments
1N/Aand results, or UTF-8 strings if the C<encoding> pragma has been used.
1N/A
1N/AOne reason why Perl does not attempt to resolve the role of Unicode in
1N/Athis cases is that the answers are highly dependent on the operating
1N/Asystem and the file system(s).  For example, whether filenames can be
1N/Ain Unicode, and in exactly what kind of encoding, is not exactly a
1N/Aportable concept.  Similarly for the qx and system: how well will the
1N/A'command line interface' (and which of them?) handle Unicode?
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/Achmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1N/Arename, rmdir, stat, symlink, truncate, unlink, utime, -X
1N/A
1N/A=item *
1N/A
1N/A%ENV
1N/A
1N/A=item *
1N/A
1N/Aglob (aka the <*>)
1N/A
1N/A=item *
1N/A
1N/Aopen, opendir, sysopen
1N/A
1N/A=item *
1N/A
1N/Aqx (aka the backtick operator), system
1N/A
1N/A=item *
1N/A
1N/Areaddir, readlink
1N/A
1N/A=back
1N/A
1N/A=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1N/A
1N/ASometimes (see L</"When Unicode Does Not Happen">) there are
1N/Asituations where you simply need to force Perl to believe that a byte
1N/Astring is UTF-8, or vice versa.  The low-level calls
1N/Autf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1N/Athe answers.
1N/A
1N/ADo not use them without careful thought, though: Perl may easily get
1N/Avery confused, angry, or even crash, if you suddenly change the 'nature'
1N/Aof scalar like that.  Especially careful you have to be if you use the
1N/Autf8::upgrade(): any random byte string is not valid UTF-8.
1N/A
1N/A=head2 Using Unicode in XS
1N/A
1N/AIf you want to handle Perl Unicode in XS extensions, you may find the
1N/Afollowing C APIs useful.  See also L<perlguts/"Unicode Support"> for an
1N/Aexplanation about Unicode at the XS level, and L<perlapi> for the API
1N/Adetails.
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AC<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1N/Apragma is not in effect.  C<SvUTF8(sv)> returns true is the C<UTF8>
1N/Aflag is on; the bytes pragma is ignored.  The C<UTF8> flag being on
1N/Adoes B<not> mean that there are any characters of code points greater
1N/Athan 255 (or 127) in the scalar or that there are even any characters
1N/Ain the scalar.  What the C<UTF8> flag means is that the sequence of
1N/Aoctets in the representation of the scalar is the sequence of UTF-8
1N/Aencoded code points of the characters of a string.  The C<UTF8> flag
1N/Abeing off means that each octet in this representation encodes a
1N/Asingle character with code point 0..255 within the string.  Perl's
1N/AUnicode model is not to use UTF-8 until it is absolutely necessary.
1N/A
1N/A=item *
1N/A
1N/AC<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1N/Aa buffer encoding the code point as UTF-8, and returns a pointer
1N/Apointing after the UTF-8 bytes.
1N/A
1N/A=item *
1N/A
1N/AC<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1N/Areturns the Unicode character code point and, optionally, the length of
1N/Athe UTF-8 byte sequence.
1N/A
1N/A=item *
1N/A
1N/AC<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1N/Ain characters.  C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
1N/Ascalar.
1N/A
1N/A=item *
1N/A
1N/AC<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1N/Aencoded form.  C<sv_utf8_downgrade(sv)> does the opposite, if
1N/Apossible.  C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1N/Ait does not set the C<UTF8> flag.  C<sv_utf8_decode()> does the
1N/Aopposite of C<sv_utf8_encode()>.  Note that none of these are to be
1N/Aused as general-purpose encoding or decoding interfaces: C<use Encode>
1N/Afor that.  C<sv_utf8_upgrade()> is affected by the encoding pragma
1N/Abut C<sv_utf8_downgrade()> is not (since the encoding pragma is
1N/Adesigned to be a one-way street).
1N/A
1N/A=item *
1N/A
1N/AC<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
1N/Acharacter.
1N/A
1N/A=item *
1N/A
1N/AC<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1N/Aare valid UTF-8.
1N/A
1N/A=item *
1N/A
1N/AC<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1N/Acharacter in the buffer.  C<UNISKIP(chr)> will return the number of bytes
1N/Arequired to UTF-8-encode the Unicode character code point.  C<UTF8SKIP()>
1N/Ais useful for example for iterating over the characters of a UTF-8
1N/Aencoded buffer; C<UNISKIP()> is useful, for example, in computing
1N/Athe size required for a UTF-8 encoded buffer.
1N/A
1N/A=item *
1N/A
1N/AC<utf8_distance(a, b)> will tell the distance in characters between the
1N/Atwo pointers pointing to the same UTF-8 encoded buffer.
1N/A
1N/A=item *
1N/A
1N/AC<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1N/Athat is C<off> (positive or negative) Unicode characters displaced
1N/Afrom the UTF-8 buffer C<s>.  Be careful not to overstep the buffer:
1N/AC<utf8_hop()> will merrily run off the end or the beginning of the
1N/Abuffer if told to do so.
1N/A
1N/A=item *
1N/A
1N/AC<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1N/AC<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1N/Aoutput of Unicode strings and scalars.  By default they are useful
1N/Aonly for debugging--they display B<all> characters as hexadecimal code
1N/Apoints--but with the flags C<UNI_DISPLAY_ISPRINT>,
1N/AC<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1N/Aoutput more readable.
1N/A
1N/A=item *
1N/A
1N/AC<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1N/Acompare two strings case-insensitively in Unicode.  For case-sensitive
1N/Acomparisons you can just use C<memEQ()> and C<memNE()> as usual.
1N/A
1N/A=back
1N/A
1N/AFor more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1N/Ain the Perl source code distribution.
1N/A
1N/A=head1 BUGS
1N/A
1N/A=head2 Interaction with Locales
1N/A
1N/AUse of locales with Unicode data may lead to odd results.  Currently,
1N/APerl attempts to attach 8-bit locale info to characters in the range
1N/A0..255, but this technique is demonstrably incorrect for locales that
1N/Ause characters above that range when mapped into Unicode.  Perl's
1N/AUnicode support will also tend to run slower.  Use of locales with
1N/AUnicode is discouraged.
1N/A
1N/A=head2 Interaction with Extensions
1N/A
1N/AWhen Perl exchanges data with an extension, the extension should be
1N/Aable to understand the UTF-8 flag and act accordingly. If the
1N/Aextension doesn't know about the flag, it's likely that the extension
1N/Awill return incorrectly-flagged data.
1N/A
1N/ASo if you're working with Unicode data, consult the documentation of
1N/Aevery module you're using if there are any issues with Unicode data
1N/Aexchange. If the documentation does not talk about Unicode at all,
1N/Asuspect the worst and probably look at the source to learn how the
1N/Amodule is implemented. Modules written completely in Perl shouldn't
1N/Acause problems. Modules that directly or indirectly access code written
1N/Ain other programming languages are at risk.
1N/A
1N/AFor affected functions, the simple strategy to avoid data corruption is
1N/Ato always make the encoding of the exchanged data explicit. Choose an
1N/Aencoding that you know the extension can handle. Convert arguments passed
1N/Ato the extensions to that encoding and convert results back from that
1N/Aencoding. Write wrapper functions that do the conversions for you, so
1N/Ayou can later change the functions when the extension catches up.
1N/A
1N/ATo provide an example, let's say the popular Foo::Bar::escape_html
1N/Afunction doesn't deal with Unicode data yet. The wrapper function
1N/Awould convert the argument to raw UTF-8 and convert the result back to
1N/APerl's internal representation like so:
1N/A
1N/A    sub my_escape_html ($) {
1N/A      my($what) = shift;
1N/A      return unless defined $what;
1N/A      Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1N/A    }
1N/A
1N/ASometimes, when the extension does not convert data but just stores
1N/Aand retrieves them, you will be in a position to use the otherwise
1N/Adangerous Encode::_utf8_on() function. Let's say the popular
1N/AC<Foo::Bar> extension, written in C, provides a C<param> method that
1N/Alets you store and retrieve data according to these prototypes:
1N/A
1N/A    $self->param($name, $value);            # set a scalar
1N/A    $value = $self->param($name);           # retrieve a scalar
1N/A
1N/AIf it does not yet provide support for any encoding, one could write a
1N/Aderived class with such a C<param> method:
1N/A
1N/A    sub param {
1N/A      my($self,$name,$value) = @_;
1N/A      utf8::upgrade($name);     # make sure it is UTF-8 encoded
1N/A      if (defined $value)
1N/A        utf8::upgrade($value);  # make sure it is UTF-8 encoded
1N/A        return $self->SUPER::param($name,$value);
1N/A      } else {
1N/A        my $ret = $self->SUPER::param($name);
1N/A        Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1N/A        return $ret;
1N/A      }
1N/A    }
1N/A
1N/ASome extensions provide filters on data entry/exit points, such as
1N/ADB_File::filter_store_key and family. Look out for such filters in
1N/Athe documentation of your extensions, they can make the transition to
1N/AUnicode data much easier.
1N/A
1N/A=head2 Speed
1N/A
1N/ASome functions are slower when working on UTF-8 encoded strings than
1N/Aon byte encoded strings.  All functions that need to hop over
1N/Acharacters such as length(), substr() or index(), or matching regular
1N/Aexpressions can work B<much> faster when the underlying data are
1N/Abyte-encoded.
1N/A
1N/AIn Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1N/Aa caching scheme was introduced which will hopefully make the slowness
1N/Asomewhat less spectacular, at least for some operations.  In general,
1N/Aoperations with UTF-8 encoded strings are still slower. As an example,
1N/Athe Unicode properties (character classes) like C<\p{Nd}> are known to
1N/Abe quite a bit slower (5-20 times) than their simpler counterparts
1N/Alike C<\d> (then again, there 268 Unicode characters matching C<Nd>
1N/Acompared with the 10 ASCII characters matching C<d>).
1N/A
1N/A=head2 Porting code from perl-5.6.X
1N/A
1N/APerl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1N/Awas required to use the C<utf8> pragma to declare that a given scope
1N/Aexpected to deal with Unicode data and had to make sure that only
1N/AUnicode data were reaching that scope. If you have code that is
1N/Aworking with 5.6, you will need some of the following adjustments to
1N/Ayour code. The examples are written such that the code will continue
1N/Ato work under 5.6, so you should be safe to try them out.
1N/A
1N/A=over 4
1N/A
1N/A=item *
1N/A
1N/AA filehandle that should read or write UTF-8
1N/A
1N/A  if ($] > 5.007) {
1N/A    binmode $fh, ":utf8";
1N/A  }
1N/A
1N/A=item *
1N/A
1N/AA scalar that is going to be passed to some extension
1N/A
1N/ABe it Compress::Zlib, Apache::Request or any extension that has no
1N/Amention of Unicode in the manpage, you need to make sure that the
1N/AUTF-8 flag is stripped off. Note that at the time of this writing
1N/A(October 2002) the mentioned modules are not UTF-8-aware. Please
1N/Acheck the documentation to verify if this is still true.
1N/A
1N/A  if ($] > 5.007) {
1N/A    require Encode;
1N/A    $val = Encode::encode_utf8($val); # make octets
1N/A  }
1N/A
1N/A=item *
1N/A
1N/AA scalar we got back from an extension
1N/A
1N/AIf you believe the scalar comes back as UTF-8, you will most likely
1N/Awant the UTF-8 flag restored:
1N/A
1N/A  if ($] > 5.007) {
1N/A    require Encode;
1N/A    $val = Encode::decode_utf8($val);
1N/A  }
1N/A
1N/A=item *
1N/A
1N/ASame thing, if you are really sure it is UTF-8
1N/A
1N/A  if ($] > 5.007) {
1N/A    require Encode;
1N/A    Encode::_utf8_on($val);
1N/A  }
1N/A
1N/A=item *
1N/A
1N/AA wrapper for fetchrow_array and fetchrow_hashref
1N/A
1N/AWhen the database contains only UTF-8, a wrapper function or method is
1N/Aa convenient way to replace all your fetchrow_array and
1N/Afetchrow_hashref calls. A wrapper function will also make it easier to
1N/Aadapt to future enhancements in your database driver. Note that at the
1N/Atime of this writing (October 2002), the DBI has no standardized way
1N/Ato deal with UTF-8 data. Please check the documentation to verify if
1N/Athat is still true.
1N/A
1N/A  sub fetchrow {
1N/A    my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1N/A    if ($] < 5.007) {
1N/A      return $sth->$what;
1N/A    } else {
1N/A      require Encode;
1N/A      if (wantarray) {
1N/A        my @arr = $sth->$what;
1N/A        for (@arr) {
1N/A          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1N/A        }
1N/A        return @arr;
1N/A      } else {
1N/A        my $ret = $sth->$what;
1N/A        if (ref $ret) {
1N/A          for my $k (keys %$ret) {
1N/A            defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1N/A          }
1N/A          return $ret;
1N/A        } else {
1N/A          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1N/A          return $ret;
1N/A        }
1N/A      }
1N/A    }
1N/A  }
1N/A
1N/A
1N/A=item *
1N/A
1N/AA large scalar that you know can only contain ASCII
1N/A
1N/AScalars that contain only ASCII and are marked as UTF-8 are sometimes
1N/Aa drag to your program. If you recognize such a situation, just remove
1N/Athe UTF-8 flag:
1N/A
1N/A  utf8::downgrade($val) if $] > 5.007;
1N/A
1N/A=back
1N/A
1N/A=head1 SEE ALSO
1N/A
1N/AL<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
1N/AL<perlretut>, L<perlvar/"${^UNICODE}">
1N/A
1N/A=cut