Pages

Tuesday, December 11, 2012

Manipulate the DOCTYPE to validate subsets of XML documents

A common misconception, especially among some folks with little XML experience, is that once you specify a DOCTYPE for a data set, that is the one you must always use, even if you're working with a subset of the document. This often leads to extracting the subset you want, adding upper level wrappers to match the DOCTYPE, and finagling things to get it to parse. Others will write custom DTDs to handle the subset.

It isn't necessary to jump through these hoops. Remember that the DOCTYPE declaration names the root element and points to the cooresponding DTD. The key here is that it points to the root element of the *document*, not the first element defined in the DTD.

For example, a journal DTD may define its upper content model like this:

<!ELEMENT journal (front, article+, appendix) >

So the DOCTYPE for the full data set would be:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE journal SYSTEM "journal.dtd">


However, say you want to send just a single article to a customer, but they want to be able to validate it against the DTD. You don't need to modify the DTD or add upper level wrappers to the article; just change the DOCTYPE from "journal" to "article".

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article SYSTEM "journal.dtd">


It reads the same DTD, but only those elements which fall within the given DOCTYPE. It ignores the rest. That's all there is to it. This is true even if you include the DOCTYPE in the DTD itself; you can still specify a specific element to start parsing from.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.