Summary

This document discusses the approach I use to publish XML content to the Web using a blind automation application called a black box.

Black Box Publishing is a means of using a build process to produce and distribute content. There are already numerous applications that have similar features, such as most Web Development Studios and "Page Maker"-type applications. The goal of Black Box Publishing was to help the non-enterprise Web author deploy XML-based content by providing a one-step build process that would generate indices, transform XML, and distribute the content to a specified location.

Most developers who are familiar with XML probably have their own version of a black box, or are already steeped in XML publication frameworks and have no need for such a tool. The xml black box is an automation process intended to leverage the structured content of XML. Where XML content should scale, the black box should not. It is a straightforward implementation comprised of basic components and configurations, and is a one-step means to publishing XML-based content.

Part I: Approach

Foreward

This article describes a vanilla automation process for automating XML transformations for the purpose of delivering content to the Web. The black box discussed within is a basic tool that implements XML and XSL processors based on a user-defined configuration, and adds a few automation routines to the transformations.

Index

Introduction

The purpose of this article is to describe a simple process for publishing XML-based content to the web. Many applications exist that facilitate the publication of HTML-based content, and transforming and serving XML content from the Web Server. What these applications do not address is the more fundamental need to perform batch transformations of XML.

Developers familiar with XML are most likely capably of finding and implementing server based solutions for live transformations, or writing their own automation jobs. Non-developers, though, can only turn to applications that include some form of XML transformation. This is unfortunate because there are a lot of advantages to basing content in XML, but many smaller sites won't do this because they are missing a fundamental component: a black box that implements particular XML and XSL parsers, and enables the author to quickly transform the content to HTML or XHTML.

What This Document Covers

This document describes a basic XML document structure and a process for publishing that XML for the Web.

What This Document Does Not Cover

This document does not cover more robust publication or transformation frameworks and applications.

Also, it does not discuss the internal mechanisms of the black box itself. If there is enough interest, the actual application implementation will be discussed in part II.

[ top ]

Document Structure

To get started, at least one XML document - preferably several documents - is needed. The process described in this article will use a very generic document structure.

A common document structure is needed to make a black box process work. If more than one document structure is used then the black box will be that much more complex. Unless otherwise specified, all documents that contain some kind of content adhere to a specific document type definition, or DTD.

Document structure refers to the base structure into which the content is added, not the formatting or structure of the content. The content can be whatever the author wants. While document structures will certainly vary, my opinion is that if the base document structure is not uniform then the content may as well be in HTML.

The standard DTD should represent a document template that describes all of the pieces of data necessary to creating the published document, and an association of content types. Creating a web site that is intelligent about what content is available is fairly straightforward with this information. The two important points are the structure needs to make sense, and the author(s) must be comfortable with that structure.

Document Template

The interesting aspect of developing a document structure is that it will suddenly balloon to include everybody. And, everybody has their own opinion on what it should look like. I am not going to promote one structure over another simply because I think such an argument is inane. The important issue to keep in mind is a document and its structure can scale as long as a uniform structure was used from the beginning.

The following structure is an example, and does not address namespaces, or even the DTD (which really my bad .. it should be used), but describes all of the elements needed to transform the document to a solitary HTML page, and to correlate that document with other documents on the site.

<?xml version="1.0" ?>
<?xml:stylesheet version="1.0" type="text/xsl" href="../xsl/documentTemplate.xsl"?>
<document>
   <metadata>
      <author>Stephen W. Cote</author>
      <title>[ title ]</title>
      <source>[ sourceFile.xml ]</source>
      <copyright>2002</copyright>
      <content-created>[ date document was created ]</content-created>
      <class type="6" />
      <distribution reprint="approve" />
   </metadata>
   <content>
      <summary>
         <p>
            This is a document template.
         </p>
      </summary>
      <p>
         [ This content goes here. ]
      </p>
   </content>
   <production />   
</document>

The above document structure was based on the XML document structure used for Microsoft's online SDK documentation. It's not copyrighted in any way that I could see (or remember), but I wanted to annotate it none-the-less.

The previous template consists of three primary child elements: metadata, content, and production. The metadata contains information that describes the contents of the document, such as the title. The content element may contain a summary, used for the indices, and the actual content. The production element is used by the black box, and to include information relevant to a specific production cycle, such as the time the document was transformed.

Manual, Dynamic, and Automatic Transformations

Given some XML and an XSL, there are three ways to transform it. Manually, dynamically, and automatically.

  • A manual transformation takes place when you physically direct the output to a specific location or stream, on a file-by-file basis. This is the most tedious way to do it.
  • A dynamic transformation takes place when an application interprets the XML, loads the XSL, and directs the output to some location for you. Examples are viewing XML in Internet Explorer, slurping XML results from a database and transforming the dynamic data, or using AxeKit or Cocoon.
  • An automatic transformation takes place when some number of XML files are associated with an XSL file and a destination. This becomes automatic as no input is required for successive transforms.

Publishing XML through a black box will use automatic transformations.

Simple Automatic Transformations

The most fundamental automatic transformation requires three distinct pieces of information: the XML source, the XSL output, and assuming the output is HTML, the destination path for the HTML. One logical configuration file format for this is to also use XML. The basic principle of performing an automatic transformation is to load the configuration file, the source XML, and the XSL into their own respective parsers, transform the XML with the XSL, and write the result to a new file specified as the destination.

The following configuration format describes these three types of data. Note that the XSL is associated with the XML by using an id/rid.

<build>
   <documents>
      <document src = "xml/templates/documentTemplate.xml">
         <template rid="1" dest="html/templates/documentTemplate.html" />
      </document>
   </documents>
   <templates>
      <template id="1" src="xsl/documentTemplate.xsl" />
   </templates>
</build>   

How the XML and XSL parsers are instantiated is up to the developer. The black box I devised was written in Java.

[ top ]

Document Correlation

Creating an XML file and transforming it with a specific XSL is no feat of magic. But, imagine a Web site that contains an index of all pages available on the site. If you add a new page, you now need to update at least two pages: the new one and the index.

Correlation by Type

Documents can be associated with other documents of a similar type by identifying the type of content contained in each document. This basic identifier makes it possible to create dynamic indices of all content available on a system by correlating documents based on their content type.

Consider the following configuration file that represents a few types of content.

<types>
   <type id="6" name="Indices" file-ref="indices">
      <desc>
         <p>Indices are generated files.</p>
      </desc>
   </type>
   <type id="100" name="Technical" file-ref="technical">
      <desc>
         <p>Topics relating to software engineering and Web design.</p>
      </desc>
      <type id="101" name="DHTML Projects" file-ref="dhtml">
         <desc>
            <p>DHTML-related projects.</p>
         </desc>
      </type>
      <type id="102" name="Design" file-ref="design">
         <desc>
            <p>These articles discuss design and code architecture.</p>
         </desc>
      </type>
   </type>
</types>

In the previous example, five types of data are used to represent specific types of content.

  • id: the id of a specific content type. The id is used by specific documents.
  • name: the name of the content type. The name is used for display purposes.
  • file-ref: the file reference is used for auto generating summary files of documents that contain content of a specific type.
  • description: the description is a short blurb that describes what the type of content represents.
  • Location: the location of each content type node, especially its parentage, is used for creating simple organizations of documents by content type.

Correlation by Configuration

Correlating documents based on the content type is a short term solution. It is very straightforward and easy to implement, but it does not address more complex associations. For example, a novel is comprised of multiple chapters. Correlating the novel chapters by content type would require a specific content type for that novel.

A more robust solution would be to specific an additional configuration of document relationships.

[ top ]

Automation: Indices

When an XML document is ready for publication, the build information is added to the configuration document. Executing the build process transforms the document and saves the result to the configured location. Correlating the documents is much easier when all of the relevant document data, such as title, content type, and summary, are located in the same place.

When I am ready to make a full build of every document on my web site, I start by creating the index. This is a very basic tool that creates a list of every XML file in a particular location, opens each file and extracts some meta data such as the title and content type, and saves the list.

The next step is to build the content indices. These are the introductory pages to the content for specific content types. This process creates the XML that will be used for the indices, and creates some production information on document linkage. This data is saved into the content index file.

The transformation process opens up each XML file in the configuration and transforms it with a specified XSL. Production data gathered in the content index is added back into the XML file before it is transformed.

The last step is to transform the document indices. Data used by the indices, such as document summaries, is drawn from the document index, rather than each specific document. The result is that access to the content follows a meaningful structure.

[ top ]

The Black Box

I chose to base all of my documents for my Web Site in XML. Whenever I change something, I have to rebuild that one document. When I add or delete a document, I have to build the new document, and rebuild the indices as well, or rebuild the entire site. Either way, the purpose of using a black box is to automate the entire process. I execute a single command and everything is updated.

I could have put the external data into a database, or implemented a different solution in ASP/JSP/PHP/CGI with AxeKit or Cocoon, or implemented the XML and XSL parsers myself. These solutions are certainly scalable and robust, but they are also a lot more involved. It is all a matter of preference, really, because with XML-based content, these distribution options will always be available.

Application Requirements

So far, document templates, content types, correlation, and indices have been discussed. The missing element is the actual program used to do all of the work. This article highlighted four distinct tasks that the black box must be able to perform. Those tasks are:

  1. Create a document index. This is a list of every XML document that may or may not be configured for distribution. Since my documents are all based on a standard template, the index will open each document and extract the title, content type, the summary if specified, and add some file statistics such as the file size. The index also contains a summary of all content types that are currently defined, and a count of how many documents are configured for each type. The index file has the following structure:

    <?xml version="1.0"?>
    <document-index>
       <documents>
          <document content-created="02/22/2002" file-size="23859"
                id="_xml_71da17761ae919f3a768c0d4e4900f81"
                title="Black Box Publishing with XML" type="102"
                xml-src="{..}/blackboxpublishing.xml">
             <summary>
                <p>
                   [ summary ]
    			</p>
             </summary>
          </document>
       </documents>
       <production>
          <type-summary count="1" id="102" />
       </production>
    </document-index>
    

    Notice that an id was automatically inserted. For a more scalable structure, each document would have its own static id. This would allow authors to easily add hyperlinks between documents without knowing the actual location of where the document will reside.

  2. Create the content indices. This is a multi-tiered list of the content-types, and also where the indices are auto-generated, and the linkage (by content type) is auto-summarized. The actual XML is nothing more than the document template, appropriate meta data, and one XML element in the content, <summarize />. When the indices are transformed, the summarize element is used to pull the document index and the build configuration together, then use the linkage to create the indices content. In addition to the auto-generated XML, the document index is modified to include the linkage. The linkage has the following structure:

    <?xml version="1.0"?>
    <document-index>
       <production>
          <type-summary count="1" id="102" />
          <linkage>
             <link dest="index_technical.html" lid="100"
                src="xml/indices/index_technical.xml" title="Technical">
                <link dest="index_design.html" lid="102"
                   src="xml/indices/index_design.xml" title="Design" />
             </link>
          </linkage>
       </production>
    </document-index>
    

    The linkage data contains both the document correlation and the reference links to the transformed indices (which have not been made yet). An XSL can use the document content type and the document index to quickly indentify where that particular document should reside in the content tree.

  3. Build the content. For a simple implementation, the XML file could be transformed with an XSL, and the result saved somewhere to disk. In this black box implementation, additional production information and file information is desired. The build application first loads the document configuration, and the index. Next, it loops through the configured XSLs, and creates a list of all the XML documents that will be transformed with that particular XSL. The XSL file is loaded once for all of the corresponding XML files, and then each XML file is loaded in turn, and transformed with the loaded XSL. The results are saved to disk at the configured location. As each file is transformed, the index is updated to reflect that the document was transformed. File statistics are also kept, such as the total number of bytes that were built.

  4. Build the indices. Even though the indices were auto generated in step 2, these should be built last. The reason is there is some production data that is used from the document index that was not set until step 3. Plus, it is helpful to know whether or not a document was transformed, so links will only be auto-generated if a document does in fact exist. Both steps 3 and 4 use the same application to build, but separate configurations. The configuration to build the indices is auto-generated in step 2.

About the Black Box Application

This system was not meant to reinvent any wheels, or meant to compete with content applications built for XML, such as AxeKit and Cocoon. Instead, it is more similar to Apache's Ant for building XML. The objective was straightforward: given some number of XML documents, and some number of style sheets, auto-transform the XML with one or more style sheets, put the output into the correct directory, and build an index based upon the type of content in the document.

This is certainly not an enterprise level application, nor was it ever meant to be. Instead, it is an exercise in using XML as the source for every document on a Web Site, and demonstrating that it can be deployed in an efficient and effective manner.

The goal of Black Box Publishing was to help the non-enterprise Web author with deploying XML-based content by providing a one-step build process that would generate indices, transform XML, and distribute the content to a specified location. Implementation is made easier and overhead is reduced by not imposing server-side requirements.

While the content should be scalable, the actual black box does not. It is a straightforward implementation comprised of basic components and configurations. It is a one-step means to publishing XML-based content.

[ top ]

Conclusion

In general, black boxes are either throw-away/one-time applications, or batch/cron jobs in their most simplest forms. The black box discussed in this article is nothing more than that. The four tasks I identified as requirements are all very basic objectives. Most of the transformation work is left up to the XML and XSL parsers, and the XSL I devised for this project.

[ top ]

Appendix A

The Term "Black Box"

The term Black Box is used to describe many different things. In airplanes, it is the flight recorder. Software testers use black box and white box testing. My use of the phrase is loosely defined as a mystery box. XML content should go in and HTML content should come out. From an author's perspective, how and why it works doesn't matter. From a developer's perspective, it should be as straight forward and sensible as possible.

XML and XSL Processors

The following are some XML and XSL processors that can be used to load parse and transform an XML file from within a script or application.

MSXML. If you installed Internet Explorer, you probably already have this. Otherwise, you can download it from Microsoft. The MSXML parser can be easily instantiated within windows applications, windows script, ASP, and even IE web pages.

Xerces and Xalan. Refer to http://xml.apache.org for more information and downloads. I have been using the Java version of the processors.

[ top ]