A document (or Web site) written in HTML may comprise many "chapters" (HTML source files), typically with a title page and a Table of Contents that links to all of the chapter files. An example is the XTRAN User's Manual, which comprises many chapters.
In some situations, we may have variants of the document, in which not all chapters participate in every variant. For example, each XTRAN licensee receives a variant of the XTRAN User's Manual specific to the licensed activity. Someone licensing XTRAN for, say, translation of Pascal to C would get an XTRAN User's Manual containing only those chapters relevant to that activity. Someone else licensing XTRAN for, say, analysis of VAX assembler would receive a different variant, containing a different selection of chapters.
Each variant must have its own title page and table of contents, which links to the chapters that participate in that variant and to the index we will create for it.
For maximum convenience of use, we would like for each occurrence of a significant term anywhere in a variant of the document to be cross linked to that term's definition, which is likely to be in a different chapter. However, we don't want to cross link the occurrence of a term whose definition is in a chapter not included in our variant.
We also want to have a thorough alphabetical index at the end of our document. Obviously, it should include only entries whose bookmarks are in chapters included in our variant. For the reader's convenience, we would also like for each index entry to show the section and chapter in which it occurs, for context, and we would like to have a letter-by-letter index to the index at the start.
As any professional indexer will tell you, the hard work is deciding how and where to index terms in the document. Although many attempts have been made to automate this process, with some success, the only way to guarantee a really good job is for a knowledgeable human being to carefully control the indexing process and "fine tune" the final product.
However, once the basic indexing of the chapters has been done, we can automate the rest of the work: Cross referencing item occurrences in a document variant's chapters and generating the document variant's index.
XTRAN treats HTML as a computer language, in which XTRAN represents as a "statement" each tag, segment of nonmarkup text, and end tag. XTRAN represents each attribute of a tag as a "statement attribute".
XTRAN's internal representation of HTML is essentially the same as for all other computer languages XTRAN manipulates, including assemblers, 3GLs such as Pascal and PL/I, meta-data languages such as XML, scripting languages, and database languages. This means that the full power of XTRAN's rules language is available to manipulate HTML.
This example shows how we can use several versions of XTRAN to automatically cross link and index all occurrences of specified text items. The example assumes that the target to which each item is to be cross linked and indexed is marked with a bookmark that has an appropriate name. Such a bookmark may or may not enclose related text.
We use two (illegal) HTML <A> attributes in the
bookmarks, NOLINK and NOINDEX, to control, in the
original HTML files, whether each bookmark is cross linked and/or
indexed. These attributes are removed in the process of cross linking, so
they don't show up in the final document variant.
We also use a set of XTRAN styling rules for HTML output that specify, for each tag, whether it and its end tag (if any) are to get preceding and/or following line breaks in the output.
Our strategy is as follows:
<A NAME="xxx" [[NOLINK]] [[NOINDEX]]>
where the optional NOLINK and/or NOINDEX
attributes control subsequent cross linking and indexing.
All changes to the document must be made to these versions.
<variant>.nam that lists all of the chapter HTML files that
comprise that variant.NOLINK and/or NOINDEX attributes in the
original HTMLFor example, assuming that there is, in chapter1.html:
<A NAME="XTRAN">XTRAN</A>
then an occurrence, perhaps in a different chapter file, of
XTRAN
in nonmarkup text would be replaced with
<A HREF="chapter1.html#XTRAN">XTRAN</A>
creating a link the reader can follow to the bookmark in
chapter1.html.
However, if the bookmark's text occurs in the most recent header, the rules don't insert a cross link, since it would probably just link to the same immediate area.
This step also removes the (illegal) NOLINK and
NOINDEX attributes from the bookmarks. We write the
resulting HTML to the variant's subdirectory.
The following is a data flowchart for this overall process, in which the elements are color coded:
The following is an English paraphrase of the major XTRAN rules used for this example.
These rules are run, with a version of XTRAN that analyzes HTML, on each HTML file in the Master Document, after the Master Document has been changed or new chapters have been added to it. We first delete the bookmark data file, so that it will be recreated "from scratch".
Read and parse HTML to be analyzed
Open bookmark data file to append
For each HTML "statement", recursively
If <H1> (chapter heading) tag
Remember its text
Else if <Hn> (heading) tag
Remember its text
Else if <A NAME="xxx"> tag
Write bookmark information to data file
Close bookmark data file
These rules are run, with a version of XTRAN that only evaluates rules, after bookmark data have been extracted from all HTML files in the Master Document, to check the bookmark data for name duplication.
Read bookmark data from file for all chapters
Create bookmark duplication output file
For each bookmark
If bookmark is not to be cross linked
Continue
For each bookmark following this one
If bookmark is not to be cross linked
Continue
If same bookmark name
Write information to output file
If bookmark name contains our bookmark name
Write information to output file
Close output file
These rules are run, with a version of XTRAN that re-engineers HTML, on each HTML chapter in a document variant, to recreate that variant after changes or additions to the Master Document.
Read list of chapters for our variant from file
Read bookmark data from file for our variant only
Read and parse HTML from Master Document version of chapter
For each HTML "statement", recursively
If it's <A NAME="xxx"> tag with NOLINK and/or NOINDEX attributes
Remove them
Continue
If it isn't non-markup text
Continue
If it's already enclosed in <A> tag (it's a bookmark
or already indexed)
Continue
For each bookmark name to be cross linked
If the bookmark's text occurs in the most recent header
Continue
For each occurrence of bookmark name in this text
Replace occurrence with a link to item's bookmark
Set to continue with 1st replacement "statement"
Output re-engineered HTML for chapter to variant's subdirectory
These rules are run, with a version of XTRAN that re-engineers and generates HTML, once for each document variant, to create its index, after changes or additions to the Master Document.
Read list of chapters for our variant from file
Read bookmark data from file for our variant only
Generate HTML index header
For each bookmark item
If bookmark is not to be indexed
Continue
Record bookmark text and sequential number (for sorting)
If new starting character
Record bookmark text's starting character (for sorting)
Sort starting characters
Generate alphabetical "index to the index" using starting characters
Sort bookmark data texts
Generate start of HTML table
For each bookmark item, alphabetically
If item is not to be indexed
Continue
If new starting character
Generate header for it as table row, including bookmark for
"index to the index"
Generate index entry as table row, including its section and chapter
Generate end of HTML table
Write out HTML we've generated as document index, to variant's
subdirectory
NOTE
Normally, we generate each document variant into its own subdirectory, so the HTML filenames need not change. In this example, however, all of the HTML files live in the same directory, so we manually adjusted their filenames and links to them. These are the only changes made by hand, and would not normally be necessary.
This example uses a "mini" version of the XTRAN User's Manual to show the effects of the cross linking and indexing procedures. This "mini" version is not proprietary; the actual XTRAN User's Manual is proprietary and requires a nondisclosure agreement. Also, this "mini" version is for demonstration purposes only and is not necessarily current. Therefore, its contents do not constitute any representation by Pennington about XTRAN.
The actual XTRAN User's Manual (including all variants) has more than 1,200 bookmarks in over 50 HTML files containing more than 20,000 nonmarkup text items; cross linking it involves as many as 25 million checks for cross links to insert.
These small text files are read by the XTRAN rules to determine which chapters participate in which document variants.
; master.nam — Chapters in XTRAN User's Manual, all variants ; Revised 2002-02-09.1258 by S. F. Heffner ; var1ttl.html var2ttl.html chapter1.html chapter2.html chapter3.html ; ; End of master.nam
; variant1.nam — Chapters in XTRAN User's Manual variant 1 ; Revised 2002-02-09.1258 by S. F. Heffner ; var1ttl.html chapter1.html chapter2.html ; ; End of variant1.nam
; variant2.nam — Chapters in XTRAN User's Manual variant 2 ; Revised 2002-02-09.1258 by S. F. Heffner ; var2ttl.html chapter1.html chapter2.html chapter3.html ; ; End of variant2.nam
COPYRIGHT 2008; reproduction prohibited without permission. Revised 2007-08-20
XTRAN is a trademark of Pennington Systems Incorporated.
Pennington Systems Incorporated
8655 East Via de Ventura, Suite G200
Scottsdale, Arizona 85258-3321
Phone: +1(480)626-5503
Fax: +1(480)626-7618
Email: Info@Pennington.com
Web: http://WWW.Pennington.com