sgmldecl.htm revision 7c478bd95313f5f23a4c958a745db2134aa03244
<!-- SCCS keyword
#pragma ident "%Z%%M% %I% %E% SMI"
-->
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN">
<HTML>
<HEAD>
<TITLE>SP - SGML declaration</TITLE>
</HEAD>
<BODY>
<H1>Handling of the SGML declaration in SP</H1>
<H2>Default SGML declaration</H2>
<P>
If the SGML declaration is omitted
and there is no applicable
<A HREF="catalog.htm#sgmldecl"><SAMP>SGMLDECL</SAMP></A>
entry in a catalog,
the following declaration will be implied:
<PRE>
&lt;!SGML "ISO 8879:1986"
CHARSET
BASESET "ISO 646-1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
SCOPE DOCUMENT
SYNTAX
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
BASESET "ISO 646-1983//CHARSET International Reference Version
(IRV)//ESC 2/5 4/0"
DESCSET 0 128 0
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR "-."
UCNMCHAR "-."
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTCNT 99999999
ATTSPLEN 99999999
DTEMPLEN 24000
ENTLVL 99999999
GRPCNT 99999999
GRPGTCNT 99999999
GRPLVL 99999999
LITLEN 24000
NAMELEN 99999999
PILEN 24000
TAGLEN 99999999
TAGLVL 99999999
FEATURES
MINIMIZE DATATAG NO
OMITTAG YES
RANK YES
SHORTTAG YES
LINK SIMPLE YES 1000
IMPLICIT YES
EXPLICIT YES 1
OTHER CONCUR NO
SUBDOC YES 99999999
FORMAL YES
APPINFO NONE>
</PRE>
<P>
with the exception that all characters that are neither significant
nor shunned will be assigned to DATACHAR.
<H2>Character sets</H2>
<P>
A character in a base character set is described either by giving its
number in a universal character set, or by specifying a minimum
literal. The constraints on the choice of universal character set are
that characters that are significant in the SGML reference concrete
syntax must be in the universal character set and must have the same
number in the universal character set as in ISO 646 and that each
character in the character set must be represented by exactly one
number; that character numbers in the range 0 to 31 and 127 to 159 are
control characters (for the purpose of enforcing SHUNCHAR CONTROLS).
It is recommended that ISO 10646 (Unicode) be used as the universal
character set, except in environments where the normal document
character sets are large character set which cannot be compactly
described in terms of ISO 10646.
The public identifier of a base character set can be associated
with an entity that describes it by using a
<SAMP>PUBLIC</SAMP>
entry in the catalog entry file.
The entity must be a fragment
of an SGML declaration
consisting of the
portion of a character set description,
following the DESCSET keyword,
that is, it must be a sequence of character descriptions,
where each character description specifies a described character
number, the number of characters and
either a character number in the universal character set, a minimum literal
or the keyword
<SAMP>UNUSED</SAMP>.
Character numbers in the universal character set can be as big as
99999999.
<P>
In addition SP has built in knowledge of a few character sets.
These are identified using the designating sequence in the
public identifier. The following designating sequences are
recognized:
<DL>
<DT>
<SAMP>ESC 2/5 4/0</SAMP>
<DD>
The full set of ISO 646 IRV.
This is not a registered character set,
but is recommended by ISO 8879 (clause 10.2.2.4).
<DT>
<SAMP>ESC 2/8 4/0</SAMP>
<DD>
G0 set of ISO 646 IRV,
ISO Registration Number 2.
<DT>
<SAMP>ESC 2/8 4/2</SAMP>
<DD>
G0 set of ASCII,
ISO Registration Number 6.
<DT>
<SAMP>ESC 2/1 4/0</SAMP>
<DD>
C0 set of ISO 646,
ISO Registration Number 1.
</DL>
<P>
All the above character sets will be treated as mapping character numbers
0 to 127 inclusive as in ISO 646.
<P>
It is not necessary for every character set used in the SGML
declaration to be known to SP
provided that characters in the document character set that are
significant both in the reference concrete syntax and in the described
concrete syntax are described using known base character sets and that
characters that are significant in the described concrete syntax are
described using the same base character sets or the same minimum
literals in both the document character set description and the syntax
reference character set description.
<H2>Concrete syntaxes</H2>
<P>
The public identifier for a public concrete syntax can be associated
with an entity that describes using a
<SAMP>PUBLIC</SAMP>
entry in the catalog entry file.
The entity must be a fragment of an SGML declaration
consisting of a concrete syntax description
starting with the
<SAMP>SHUNCHAR</SAMP>
keyword
as in an SGML declaration.
The entity can also make use of the following extensions:
<UL>
<LI>
An
<I>added function</I>
can be expressed as a parameter literal
instead of a name.
<LI>
The replacement for a reference reserved name
can be expressed as a parameter literal instead of a name.
<LI>
The
<SAMP>LCNMSTRT</SAMP>,
<SAMP>UCNMSTRT</SAMP>,
<SAMP>LCNMCHAR</SAMP>
and
<SAMP>UCNMCHAR</SAMP>
keywords may each be followed by more than one parameter literal. A
sequence of parameter literals has the same meaning as a single
parameter literal whose content is the concatenation of the content of
each of the literals in the sequence. This extension is useful
because of the restriction on the length of a parameter literal in the
SGML declaration to 240 characters.
<LI>
The total number of characters specified for
<SAMP>UCNMCHAR</SAMP>
or
<SAMP>UCNMSTRT</SAMP>
may exceed the total number of characters specified for
<SAMP>LCNMCHAR</SAMP>
or
<SAMP>LCNMSTRT</SAMP>
respectively.
Each character in
<SAMP>UCNMCHAR</SAMP>
or
<SAMP>UCNMSTRT</SAMP>
which does not have a corresponding character in the same position in
<SAMP>LCNMCHAR</SAMP>
or
<SAMP>LCNMSTRT</SAMP>
is simply assigned to <SAMP>UCNMCHAR</SAMP> or <SAMP>UCNMSTRT</SAMP>
without making it the upper-case form of any character.
<LI>
A parameter following any of
<SAMP>LCNMSTRT</SAMP>,
<SAMP>UCNMSTRT</SAMP>,
<SAMP>LCNMCHAR</SAMP>
and
<SAMP>UCNMCHAR</SAMP>
keywords may be followed by
the name token <SAMP>...</SAMP>
(three periods) and another parameter literal.
This has the same meaning as the two parameter literals
with a parameter literal in between
containing in order each character whose number
is greater than the number of the last character in
the first parameter literal and less than the
number of the first character in the second
parameter literal.
A parameter literal must contain at least one character for each
<SAMP>...</SAMP>
to which it is adjacent.
<LI>
A number may be used as a parameter following the
<SAMP>LCNMSTRT</SAMP>,
<SAMP>UCNMSTRT</SAMP>,
<SAMP>LCNMCHAR</SAMP>
and
<SAMP>UCNMCHAR</SAMP>
keywords or as a delimiter in the
<SAMP>DELIM</SAMP>
section with the same meaning as a parameter literal
containing just a numeric character reference with that number.
<LI>
The parameters following the
<SAMP>LCNMSTRT</SAMP>,
<SAMP>UCNMSTRT</SAMP>,
<SAMP>LCNMCHAR</SAMP>
and
<SAMP>UCNMCHAR</SAMP>
keywords may be omitted.
This has the same meaning as specifying
an empty parameter literal.
<LI>
Within the specification of the short reference delimiters,
a parameter literal containing exactly one character
may be followed by the name token <SAMP>...</SAMP>
and another parameter literal containing exactly one character.
This has the same meaning as a sequence of parameter literals
one for each character number that is greater than or equal
to the number of the character in the first parameter literal
and less than or equal to the number of the character in the
second parameter literal.
</UL>
<H2>Capacity sets</H2>
<P>
The public identifier for a public capacity set can be associated
with an entity that describes using a
<SAMP>PUBLIC</SAMP>
entry in the catalog entry file.
The entity must be a fragment of an SGML declaration
consisting of a sequence of capacity names and numbers.
<P>
<ADDRESS>
James Clark<BR>
jjc@jclark.com
</ADDRESS>
</BODY>
</HTML>