1056N/ACompound Text Encoding
1056N/A
1056N/AX Consortium Standard
1056N/A
1056N/ARobert W. Scheifler
1056N/A
1276N/AX Consortium
1276N/A
1276N/AX Version 11, Release 7.7
1056N/A
1056N/AVersion 1.1
1056N/A
1276N/ACopyright © 1989 X Consortium
1056N/A
1056N/APermission is hereby granted, free of charge, to any person obtaining a copy of
1056N/Athis software and associated documentation files (the "Software"), to deal in
1056N/Athe Software without restriction, including without limitation the rights to
1056N/Ause, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
1056N/Aof the Software, and to permit persons to whom the Software is furnished to do
1056N/Aso, subject to the following conditions:
1056N/A
1056N/AThe above copyright notice and this permission notice shall be included in all
1056N/Acopies or substantial portions of the Software.
1056N/A
1276N/ATHE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
1056N/AIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
1056N/AFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE X
1056N/ACONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
1056N/AACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
1056N/AWITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
1056N/A
1056N/AExcept as contained in this notice, the name of the X Consortium shall not be
1056N/Aused in advertising or otherwise to promote the sale, use or other dealings in
1056N/Athis Software without prior written authorization from the X Consortium.
1056N/A
1276N/AX Window System is a trademark of The Open Group.
1276N/A
1056N/A━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1056N/A
1056N/ATable of Contents
1056N/A
1056N/AOverview
1056N/AValues
1056N/AControl Characters
1056N/AStandard Character Set Encodings
1056N/AApproved Standard Encodings
1056N/ANon-Standard Character Set Encodings
1056N/ADirectionality
1056N/AResources
1056N/AFont Names
1056N/AExtensions
1056N/AErrors
1056N/A
1056N/AOverview
1056N/A
1056N/ACompound Text is a format for multiple character set data, such as
1056N/Amulti-lingual text. The format is based on ISO standards for encoding and
1056N/Acombining character sets. Compound Text is intended to be used in three main
1056N/Acontexts: inter-client communication using selections, as defined in the
1056N/AInter-Client Communication Conventions Manual (ICCCM); window properties (e.g.,
1056N/Awindow manager hints as defined in the ICCCM); and resources (e.g., as defined
1056N/Ain Xlib and the Xt Intrinsics).
1056N/A
1056N/ACompound Text is intended as an external representation, or interchange format,
1056N/Anot as an internal representation. It is expected (but not required) that
1056N/Aclients will convert Compound Text to some internal representation for
1056N/Aprocessing and rendering, and convert from that internal representation to
1056N/ACompound Text when providing textual data to another client.
1056N/A
1056N/AValues
1056N/A
1056N/AThe name of this encoding is "COMPOUND_TEXT". When text values are used in the
1056N/AICCCM-compliant selection mechanism or are stored as window properties in the
1056N/Aserver, the type used should be the atom for "COMPOUND_TEXT".
1056N/A
1056N/AOctet values are represented in this document as two decimal numbers in the
1056N/Aform col/row. This means the value (col * 16) + row. For example, 02/01 means
1056N/Athe value 33.
1056N/A
1056N/AFor our purposes, the octet encoding space is divided into four ranges:
1056N/A
1056N/AC0 octets from 00/00 to 01/15
1056N/AGL octets from 02/00 to 07/15
1056N/AC1 octets from 08/00 to 09/15
1056N/AGR octets from 10/00 to 15/15
1056N/A
1056N/AC0 and C1 are "control character" sets, while GL and GR are "graphic character"
1056N/Asets. Only a subset of C0 and C1 octets are used in the encoding, and depending
1056N/Aon the character set encoding defined as GL or GR, a subset of GL and GR octets
1056N/Amay be used; see below for details. All octets (00/00 to 15/15) may appear
1056N/Ainside the text of extended segments (defined below).
1056N/A
1056N/A[For those familiar with ISO 2022, we will use only an 8-bit environment, and
1056N/Awe will always use G0 for GL and G1 for GR.]
1056N/A
1056N/AControl Characters
1056N/A
1056N/AIn C0, only the following values will be used:
1056N/A
1056N/A00/09 HT HORIZONTAL TABULATION
1056N/A00/10 NL NEW LINE
1056N/A01/11 ESC (ESCAPE)
1056N/A
1056N/AIn C1, only the following value will be used:
1056N/A
1056N/A09/11 CSI CONTROL SEQUENCE INTRODUCER
1056N/A
1056N/A[The alternate 7-bit CSI encoding 01/11 05/11 is not used in Compound Text.]
1056N/A
1056N/ANo control sequences are defined in Compound Text for changing the C0 and C1
1056N/Asets.
1056N/A
1056N/AA horizontal tab can be represented with the octet 00/09. Specification of
1056N/Atabulation width settings is not part of Compound Text and must be obtained
1056N/Afrom context (in an unspecified manner).
1056N/A
1056N/A[Inclusion of horizontal tab is for consistency with the STRING type currently
1056N/Adefined in the ICCCM.]
1056N/A
1056N/AA newline (line separator/terminator) can be represented with the octet 00/10.
1056N/A
1056N/A[Note that 00/10 is normally LINEFEED, but is being interpreted as NEWLINE.
1056N/AThis can be thought of as using the (deprecated) NEW LINE mode, E.1.3, in ISO
1056N/A6429. Use of this value instead of 08/05 (NEL, NEXT LINE) is for consistency
1056N/Awith the STRING type currently defined in the ICCCM.]
1056N/A
1056N/AThe remaining C0 and C1 values (01/11 and 09/11) are only used in the control
1056N/Asequences defined below.
1056N/A
1056N/AStandard Character Set Encodings
1056N/A
1056N/AThe default GL and GR sets in Compound Text correspond to the left and right
1056N/Ahalves of ISO 8859-1 (Latin 1). As such, any legal instance of a STRING type
1056N/A(as defined in the ICCCM) is also a legal instance of type COMPOUND_TEXT.
1056N/A
1056N/A[The implied initial state in ISO 2022 is defined with the sequence: 01/11 02/
1056N/A00 04/03 GO and G1 in an 8-bit environment only. Designation also invokes. 01/
1056N/A11 02/00 04/07 In an 8-bit environment, C1 represented as 8-bits. 01/11 02/00
1056N/A04/09 Graphic character sets can be 94 or 96. 01/11 02/00 04/11 8-bit code is
1056N/Aused. 01/11 02/08 04/02 Designate ASCII into G0. 01/11 02/13 04/01 Designate
1056N/Aright-hand part of ISO Latin-1 into G1. ]
1056N/A
1056N/ATo define one of the approved standard character set encodings to be the GL
1056N/Aset, one of the following control sequences is used:
1056N/A
1056N/A01/11 02/08 {I} F 94 character set
1056N/A01/11 02/04 02/08{I} F 94^N character set
1056N/A
1056N/ATo define one of the approved standard character set encodings to be the GR
1056N/Aset, one of the following control sequences is used:
1056N/A
1056N/A01/11 02/09 {I} F 94 character set
1056N/A01/11 02/13 {I} F 96 character set
1056N/A01/11 02/04 02/09 {I} F 94^N character set
1056N/A
1056N/AThe "F"in the control sequences above stands for "Final character", which is
1056N/Aalways in the range 04/00 to 07/14. The "{I}" stands for zero or more
1056N/A"intermediate characters", which are always in the range 02/00 to 02/15, with
1056N/Athe first intermediate character always in the range 02/01 to 02/03. The
1056N/Aregistration authority has defined an "{I} F" sequence for each registered
1056N/Acharacter set encoding.
1056N/A
1056N/A[Final characters for private encodings (in the range 03/00 to 03/15) are not
1056N/Apermitted here in Compound Text.]
1056N/A
1056N/AFor GL, octet 02/00 is always defined as SPACE, and octet 07/15 (normally
1056N/ADELETE) is never used. For a 94-character set defined as GR, octets 10/00 and
1056N/A15/15 are never used.
1056N/A
1056N/A[This is consistent with ISO 2022.]
1056N/A
1056N/AA 94^N character set uses N octets (N > 1) for each character. The value of N
1056N/Ais derived from the column value for F:
1056N/A
1056N/Acolumn 04 or 05 2 octets
1056N/Acolumn 06 3 octets
1056N/Acolumn 07 4 or more octets
1056N/A
1056N/AIn a 94^N encoding, the octet values 02/00 and 07/15 (in GL) and 10/00 and 15/
1056N/A15 (in GR) are never used.
1056N/A
1056N/A[The column definitions come from ISO 2022.]
1056N/A
1056N/AOnce a GL or GR set has been defined, all further octets in that range (except
1056N/Awithin control sequences and extended segments) are interpreted with respect to
1056N/Athat character set encoding, until the GL or GR set is redefined. GL and GR
1056N/Asets can be defined independently, they do not have to be defined in pairs.
1056N/A
1056N/ANote that when actually using a character set encoding as the GR set, you must
1056N/Aforce the most significant bit (08/00) of each octet to be a one, so that it
1056N/Afalls in the range 10/00 to 15/15.
1056N/A
1056N/A[Control sequences to specify character set encoding revisions (as in section
1056N/A6.3.13 of ISO 2022) are not used in Compound Text. Revision indicators do not
1056N/Aappear to provide useful information in the context of Compound Text. The most
1056N/Arecent revision can always be assumed, since revisions are upward compatible.]
1056N/A
1056N/AApproved Standard Encodings
1056N/A
1056N/AThe following are the approved standard encodings to be used with Compound
1056N/AText. Note that none have Intermediate characters; however, a good parser will
1056N/Astill deal with Intermediate characters in the event that additional encodings
1056N/Aare later added to this list.
1056N/A
1276N/A┌────┬────┬───────────────────────────────────────────────────────────────────┐
1276N/A│{I} │94/ │Description │
1276N/A│F │96 │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│4/02│94 │7-bit ASCII graphics (ANSI X3.4-1968), Left half of ISO 8859 sets │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │94 │Right half of JIS X0201-1976 (reaffirmed 1984), 8-Bit │
1276N/A│09 │ │Alphanumeric-Katakana Code │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │94 │Left half of JIS X0201-1976 (reaffirmed 1984), 8-Bit │
1276N/A│10 │ │Alphanumeric-Katakana Code │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-1, Latin alphabet No. 1 │
1276N/A│01 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-2, Latin alphabet No. 2 │
1276N/A│02 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-3, Latin alphabet No. 3 │
1276N/A│03 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-4, Latin alphabet No. 4 │
1276N/A│04 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-7, Latin/Greek alphabet │
1276N/A│06 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-6, Latin/Arabic alphabet │
1276N/A│07 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-8, Latin/Hebrew alphabet │
1276N/A│08 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-5, Latin/Cyrillic alphabet │
1276N/A│12 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │96 │Right half of ISO 8859-9, Latin alphabet No. 5 │
1276N/A│13 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │942 │GB2312-1980, China (PRC) Hanzi │
1276N/A│01 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │942 │JIS X0208-1983, Japanese Graphic Character Set │
1276N/A│02 │ │ │
1276N/A├────┼────┼───────────────────────────────────────────────────────────────────┤
1276N/A│04/ │942 │KS C5601-1987, Korean Graphic Character Set │
1276N/A│03 │ │ │
1276N/A└────┴────┴───────────────────────────────────────────────────────────────────┘
1056N/A
1056N/AThe sets listed as "Left half of ..." should always be defined as GL. The sets
1056N/Alisted as "Right half of ..." should always be defined as GR. Other sets can be
1056N/Adefined either as GL or GR.
1056N/A
1056N/ANon-Standard Character Set Encodings
1056N/A
1056N/ACharacter set encodings that are not in the list of approved standard encodings
1056N/Acan be included using "extended segments". An extended segment begins with one
1056N/Aof the following sequences:
1056N/A
1056N/A01/11 2/05 02/15 03/00 M L variable number of octets per character
1056N/A01/11 2/05 02/15 03/01 M L 1 octet per character
1056N/A01/11 2/05 02/15 03/02 M L 2 octet per character
1056N/A01/11 2/05 02/15 03/03 M L 3 octet per character
1056N/A01/11 2/05 02/15 03/04 M L 4 octet per character
1056N/A
1056N/A[This uses the "other coding system" of ISO 2022, using private Final
1056N/Acharacters.]
1056N/A
1056N/AThe "M" and "L" octets represent a 14-bit unsigned value giving the number of
1056N/Aoctets that appear in the remainder of the segment. The number is computed as
1056N/A((M - 128) * 128) + (L - 128). The most significant bit M and L are always set
1056N/Ato one. The remainder of the segment consists of two parts, the name of the
1056N/Acharacter set encoding and the actual text. The name of the encoding comes
1056N/Afirst and is separated from the text by the octet 00/02 (STX, START OF TEXT).
1056N/ANote that the length defined by M and L includes the encoding name and
1056N/Aseparator.
1056N/A
1056N/A[The encoding of the length is chosen to avoid having zero octets in Compound
1056N/AText when possible, because embedded NUL values are problematic in many C
1056N/Alanguage routines. The use of zero octets cannot be ruled out entirely however,
1056N/Asince some octets in the actual text of the extended segment may have to be
1056N/Azero.]
1056N/A
1056N/AThe name of the encoding should be registered with the X Consortium to avoid
1056N/Aconflicts and should when appropriate match the CharSet Registry and Encoding
1056N/Aregistration used in the X Logical Font Description. The name itself should be
1056N/Aencoded using ISO 8859-1 (Latin 1), should not use question mark (03/15) or
1056N/Aasterisk (02/10), and should use hyphen (02/13) only in accordance with the X
1056N/ALogical Font Description.
1056N/A
1056N/AExtended segments are not to be used for any character set encoding that can be
1056N/Aconstructed from a GL/GR pair of approved standard encodings. For example, it
1056N/Ais incorrect to use an extended segment for any of the ISO 8859 family of
1056N/Aencodings.
1056N/A
1056N/AIt should be noted that the contents of an extended segment are arbitrary; for
1056N/Aexample, they may contain octets in the C0 and C1 ranges, including 00/00, and
1056N/Aoctets comprising a given character may differ in their most significant bit.
1056N/A
1056N/A[ISO-registered "other coding systems" are not used in Compound Text; extended
1056N/Asegments are the only mechanism for non-2022 encodings.]
1056N/A
1056N/ADirectionality
1056N/A
1056N/AIf desired, horizontal text direction can be indicated using the following
1056N/Acontrol sequences:
1056N/A
1056N/A09/11 03/01 05/13 begin left-to-right text
1056N/A09/11 03/02 05/13 begin right-to-left text
1056N/A09/11 05/13 end of string
1056N/A
1056N/A[This is a subset of the SDS (START DIRECTED STRING) control in the Draft
1056N/ABidirectional Addendum to ISO 6429.]
1056N/A
1056N/ADirectionality can be nested. Logically, a stack of directions is maintained.
1056N/AEach of the first two control sequences pushes a new direction on the stack,
1056N/Aand the third sequence (revert) pops a direction from the stack. The stack
1056N/Astarts out empty at the beginning of a Compound Text string. When the stack is
1056N/Aempty, the directionality of the text is unspecified.
1056N/A
1056N/ADirectionality applies to all subsequent text, whether in GL, GR, or an
1056N/Aextended segment. If the desired directionality of GL, GR, or extended segments
1056N/Adiffers, then directionality control sequences must be inserted when switching
1056N/Abetween them.
1056N/A
1056N/ANote that definition of GL and GR sets is independent of directionality;
1056N/Adefining a new GL or GR set does not change the current directionality, and
1056N/Apushing or popping a directionality does not change the current GL and GR
1056N/Adefinitions.
1056N/A
1056N/ASpecification of directionality is entirely optional; text direction should be
1056N/Aclear from context in most cases. However, it must be the case that either all
1056N/Acharacters in a Compound Text string have explicitly specified direction or
1056N/Athat all characters have unspecified direction. That is, if directionality
1056N/Acontrol sequences are used, the first such control sequence must precede the
1056N/Afirst graphic character in a Compound Text string, and graphic characters are
1056N/Anot permitted whenever the directionality stack is empty.
1056N/A
1056N/AResources
1056N/A
1056N/ATo use Compound Text in a resource, you can simply treat all octets as if they
1056N/Awere ASCII/Latin-1 and just replace all "\" octets (05/12) with the two octets
1056N/A"\\", all newline octets (00/10) with the two octets "\n", and all zero octets
1056N/Awith the four octets "\000". It is up to the client making use of the resource
1056N/Ato interpret the data as Compound Text; the policy by which this is ascertained
1056N/Ais not constrained by the Compound Text specification.
1056N/A
1056N/AFont Names
1056N/A
1056N/AThe following CharSet names for the standard character set encodings are
1056N/Aregistered for use in font names under the X Logical Font Description:
1056N/A
1276N/A┌───────────────┬──────────────────────────────┬──────────────────────────────┐
1276N/A│Name │Encoding Standard │Description │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-1 │ISO8859-1 │Latinalphabet No. 1 │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-2 │ISO8859-2 │Latinalphabet No. 2 │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-3 │ISO8859-3 │Latinalphabet No. 3 │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-4 │ISO8859-4 │Latinalphabet No. 4 │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-5 │ISO 8859-5 │Latin/Cyrillic alphabet │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-6 │ISO 8859-6 │Latin/Arabic alphabet │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-7 │ISO8859-7 │Latin/Greekalphabet
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-8 │ISO8859-8 │Latin/Hebrew alphabet │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│ISO8859-9 │ISO8859-9 │Latinalphabet No. 5 │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│JISX0201.1976-0│JIS X0201-1976 (reaffirmed │8-bit Alphanumeric-Katakana │
1276N/A│ │1984) │Code │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│GB2312.1980-0 │GB2312-1980, GL encoding │China (PRC) Hanzi │
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│JISX0208.1983-0│JIS X0208-1983, GL encoding │Japanese Graphic Character Set│
1276N/A├───────────────┼──────────────────────────────┼──────────────────────────────┤
1276N/A│KSC5601.1987-0 │KS C5601-1987, GL encoding │Korean Graphic Character Set │
1276N/A└───────────────┴──────────────────────────────┴──────────────────────────────┘
1056N/A
1056N/AExtensions
1056N/A
1056N/AThere is no absolute requirement for a parser to deal with anything but the
1056N/Aparticular encoding syntax defined in this specification. However, it is
1056N/Apossible that Compound Text may be extended in the future, and as such it may
1056N/Abe desirable to construct the parser to handle 2022/6429 syntax more generally.
1056N/A
1056N/AThere are two general formats covering all control sequences that are expected
1056N/Ato appear in extensions:
1056N/A
1056N/A01/11 {I} F
1056N/A
1056N/AFor this format, I is always in the range 02/00 to 02/15, and F is always in
1056N/Athe range 03/00 to 07/14.
1056N/A
1056N/A09/11 {P} {I} F
1056N/A
1056N/AFor this format, P is always in the range 03/00 to 03/15, I is always in the
1056N/Arange 02/00 to 02/15, and F is always in the range 04/00 to 07/14.
1056N/A
1056N/AIn addition, new (singleton) control characters (in the C0 and C1 ranges) might
1056N/Abe defined in the future.
1056N/A
1056N/AFinally, new kinds of "segments" might be defined in the future using syntax
1056N/Asimilar to extended segments:
1056N/A
1056N/A01/11 02/05 02/15 F M L
1056N/A
1056N/AFor this format, F is in the range 03/05 to 3/15. M and L are as defined in
1056N/Aextended segments. Such a segment will always be followed by the number of
1056N/Aoctets defined by M and L. These octets can have arbitrary values and need not
1056N/Afollow the internal structure defined for current extended segments.
1056N/A
1056N/AIf extensions to this specification are defined in the future, then any string
1056N/Aincorporating instances of such extensions must start with one of the following
1056N/Acontrol sequences:
1056N/A
1056N/A01/11 02/03 V 03/00 ignoring extensions is OK
1056N/A01/11 02/03 V 03/01 ignoring extensions is not OK
1056N/A
1056N/AIn either case, V is in the range 02/00 to 02/15 and indicates the major
1056N/Aversion minus one of the specification being used. These version control
1056N/Asequences are for use by clients that implement earlier versions, but have
1056N/Aimplemented a general parser. The first control sequence indicates that it is
1056N/Aacceptable to ignore all extension control sequences; no mandatory information
1056N/Awill be lost in the process. The second control sequence indicates that it is
1056N/Aunacceptable to ignore any extension control sequences; mandatory information
1056N/Awould be lost in the process. In general, it will be up to the client
1056N/Agenerating the Compound Text to decide which control sequence to use.
1056N/A
1056N/AErrors
1056N/A
1056N/AIf a Compound Text string does not match the specification here (e.g., uses
1056N/Aundefined control characters, or undefined control sequences, or incorrectly
1056N/Aformatted extended segments), it is best to treat the entire string as invalid,
1056N/Aexcept as indicated by a version control sequence.
1056N/A