SGML::DTD - SGML DTD parser
use SGML::DTD; $dtd = new SGML::DTD; $dtd->read_dtd(\*FILEHANDLE); $dtd = new SGML::DTD \*FILEHANDLE; SGML::DTD->set_ent_manager($entity_manager); $dtd = new SGML::DTD; $dtd->read_dtd(\*FILEHANDLE); $dtd = new SGML::DTD \*FILEHANDLE, $entity_manager;
SGML::DTD is an SGML DTD parser. Either during object construction or by the read_dtd method, you pass a filehandle to SGML::DTD that contains the DTD you want parsed. To avoid package scoping problems, a reference to a filehandle should be passed. If passing a filehandle to object construction, undef will be returned if a parsing error occurs. If using the read_dtd method, 1 is returned when no errors occurred; 0 returned on an error.
When parsing the DTD, SGML::DTD builds up data structures that represent the information contained in the DTD. Various methods are provided to access DTD information. See Object Methods for the methods available.
For SGML::DTD to resolve external entity references, SGML::DTD uses an SGML::EntMan object. If no entity manager is passed to SGML::DTD, SGML::DTD uses the default construction rule of SGML::EntMan to create an entity manager to resolve external entity references. Normally, this will not be sufficient. Therefore, SGML::EntMan object should be created first with loaded DTD specific catalogs. Then instantiate an SGML::DTD object and pass the SGML::EntMan object to it. The SGML::EntMan object can be specified during SGML::DTD construction, or by the set_ent_manager class method.
The following describes the current limitations of SGML::DTD:
Concurrent DTDs are not distinguished. However, multiple SGML::DTD instances can be created by a program. Also, if the input contains a DOCTYPE declaration, SGML::DTD will terminate parsing at the close of the DOCTYPE declaration. Therefore, another SGML::DTD instance can be creating if another DOCTYPE declaration is in the input stream.
LINKTYPE, SHORTREF, USEMAP declarations are ignored.
Rank element declarations are not supported.
SGML::DTD assumes the reference concrete syntax with the following exceptions: generic identifiers and entity names can be of any length and include the '_' character. Variant syntaxes can be supported by modifying variable definitions in the SGML::Syntax module.
SGML::DTD is not designed to be a DTD syntax validator. When a parsing error occurs, parsing is terminated and the error message is not very descriptive. For validation, a program like nsgmls should be used.
The entity manager is shared across all SGML::DTD instances. This can be a problem if there is a desire to have multiple SGML::DTD instances and the DTDs have same external identifiers, but should resolve to different system identifiers. This can be handled by changing the entity manager before parsing each DTD.
Element names are treated with case-insensitivity, but entity names are case-sensitive.
Class methods are methods that apply at the class level. Therefore, they may affect all instances of the SGML::DTD class. Class methods can be invoked like the following:
SGML::DTD->set_ent_manager($entman);
or,
set_ent_manager SGML::DTD $entman;
The following class methods are defined:
new creates a new SGML::DTD object. An optional filehandle argument can be specified to cause new to automatically parse the DTD represented by the filehandle. If a filehandle is specified, and optional SGML::EntMan object may be specified for resolving any external entity references.
is_attr_keyword returns 1 if $word is an attribute content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:
Character case is ignored.
is_elem_keyword returns 1 if $word is an element content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:
Character case is ignored.
DTDis_group_connector returns 1 if $char is an group connector, otherwise, it returns 0. The following values of $char will return 1:
DTDis_occur_indicator returns 1 if $char is an occurence indicator, otherwise, it returns 0. The following values of $char will return 1:
is_tag_name returns 1 if $string is a legal tag name, otherwise, it returns 0. Legal characters in a tag name are defined by the SGML::Syntax::$namechars variable. By default, a tag name may only contain the characters "A-Za-z_.-".
Set a function to be called during parsing when a comment declaration is encountered. The comment callback function is invoked as follows:
&$coderef(\$comment_txt);
Set a function to be called when a debugging message is generated. The debug callback function is invoked as follows:
&$coderef(@string_list);
Debugging messages are only generated if verbosity is set to true.
Set the filehandle to send debugging messages. Messages are not sent to the filehandle if a debug callback function is registered. The default filehandle is STDERR.
Set the entity manager. The entity manager will be used to resolve any external identifiers during parsing. The entity manager should be of type SGML::EntMan.
Set a function to be called during parsing when an error occurs The error callback function is invoked as follows:
&$coderef(@string_list);
Set the filehandle to send error messages. Messages are not sent to the filehandle if an error callback function is registered. The default filehandle is STDERR.
Set a function to be called during parsing when a processing instruction is encountered. The pi callback function is invoked as follows:
&$coderef(\$pi_txt);
Set callback for printing a tree entry when the print_tree object method is invoked. The tree entry callback function is invoked as follows:
&$coderef($iselem_flag, $string);
This method allows you to modify the text output of the print_tree method. However, it does require some understanding of the string passed into callback to do anything interesting with it. The method mainly exists for the use for a specific application, so its use is discouraged.
The tells if SGML::DTD should output debugging messages as it parses a DTD.
Parse a DTD from FILEHANDLE.
The following methods are applicable after a DTD has been parsed:
get_base_children returns an array of the elements in the base model group of $element. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.
Example:
<!ELEMENT foo (x | y | z) +(a | b) -(m | n)>
The call
$dtd->get_base_children(`foo')
will return
('x', 'y', 'z')
The call
$dtd->get_base_children('foo', 1)
will return
('(','x', '|', 'y', '|', 'z', ')')
Retrieve the attributes defined for $element. The return value is a hash where the keys are the attribute names, and the values is the definitions of the attributes. The definitions are stored as a list. The first list value the default value for the attribute (which may be an SGML reserved word). If the default value equals "#FIXED", then the next array value is the #FIXED value. The other array values are all possible values for the attribute.
Retrieve all elements defined in the DTD. If $nosort is true, the elements are returned in the order they were defined in the DTD. Otherwise, they are in sorted order.
Retrieve all elements that have an attribute $attr_name defined in the DTD.
get_exc_children returns an array of the elements in the exclusion model group of $element. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.
Example:
<!ELEMENT foo (x | y | z) +(a | b) -(m | n)>
The call
$dtd->get_exc_children('foo')
will return
('m', 'n')
get_gen_ents returns an array of general entities. An optional flag argument can be passed to the routine to determine is elements returned are sorted or not: 0 => sorted, 1 => not sorted.
get_gen_data_ents returns an array of general data entities defined in the DTD. Data entities cover the following:
get_inc_children returns an array of the elements in the inclusion model group of $element. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.
Example:
<!ELEMENT foo (x | y | z) +(a | b) -(m | n)>
The call
$dtd->get_inc_children('foo')
will return
('a', 'b')
Get all elements that may be a parent of $element.
Get the top-most elements defined in the DTD. Top-most elements are those elements that cannot be contained within another element or can only be contained within itself.
is_child returns 1 if $child can be a legal child of $element. Otherwise, 0 is returned.
is_element returns 1 if $element is defined in the DTD. Otherwise, 0 is returned.
print_tree outputs an ASCII tree structure of $element's content hierarchy to a depth of $depth to FILEHANDLE. See Element Trees for information on output created by print_tree.
Clear object data structures. Use this method if you want to use the same object to parse another DTD.
Once a DTD is parsed, the print_tree method can be used to output ASCII formatted trees of content hierarchies of elements. The print_tree method is invoked as follows:
$dtd->print_tree($element, $depth, \*FILEHANDLE)
$element is the element to print the tree for. $depth specifies the maximum depth of the tree. The root of the tree has a depth of 1. FILEHANDLE specifies where the output goes to.
The tree shows the overall content hierarchy for an element.
Content hierarchies of descendents will also be shown. Elements that
exist at a higher (or equal) level, or if the maximum depth has been
reached, are pruned. The string "...
" is appended to an
element if it has been pruned due to pre-existance at a higher (or
equal) level. The content of the pruned element can be determined
by searching for the complete tree of the element (ie. elements w/o
"...
"). Elements pruned because maximum depth has been
reached will not have "...
" appended.
Example:
|__section+) |_(effect?, ... |__title, ... |__toc?, ... |__epc-fig*, | |_(effect?, ... | |__figure, | | |_(effect?, ... | | |__title, ... | | |__graphic+, ... | | |__assoc-text?)
Pruning must be done to avoid a combinatorical explosion. It is common for DTD's to define content hierarchies of infinite depth. Even with a predefined maximum depth, the generated tree can become very large.
Since the tree outputed is static, the inclusion and exclusion sets
of elements are treated specially. Inclusion and exclusion elements
inherited from ancestors are not propagated down to determine
what elements are printed, but special markup is presented at a
given element if there exists inclusion and exclusion elements from
ancestors. The reason inclusions and exclusions are not propagated down
is because of the pruning done. Since an element may occur in multiple
contexts -- and have different ancestoral inclusions and exclusions in
effect -- an element without "...
" may be the only place
of reference to see the content hierarchy of the element.
Example:
D1 | {+} idx needbegin needend newline | |_(head, | | {A+} idx needbegin needend newline | | {-} needbegin needend | | | |_(((#PCDATA | | |____((acro | | | | {A+} idx needbegin needend newline | | | {A-} needbegin needend | | | | | |_(((#PCDATA | | | |____((super | ... | | |______sub)))*)) ...
Ignoring the lines starting with {}'s, one gets the content
hierachy of an element as defined by the DTD without concern of where
it may occur in the overall structure. The {} lines give additional
information regarding the element with respect to its existance
within a specific context. For example, when an ACRO
element occurs within D1,HEAD
-- along with its normal
content -- it can contain IDX
and NEWLINE
elements due to inclusions from ancestors. However, it cannot contain
NEEDBEGIN
and NEEDEND
regardless of its
defined content since an ancestor(s) excludes them.
NEEDBEGIN
,
NEEDEND
are excluded from ACRO
.Explanation of {}'s keys:
{+}
{+}
appended
to the subelement entry.
{A+}
{-}
{-}
appended to the subelement
listing.
{A-}
perl(1)
This software is part of the perlSGML package; see (http://www.oac.uci.edu/indiv/ehood/perlSGML.html)