Java StAX Tips

StAX stands for Streaming API for XML APIs, meaning the XML reader and writer that are contained in package javax.xml.stream of the Java standard library. The first tip is to know this API exists and you should use it! The older APIs, Simple API for XML APIs (SAX parser) and Document Object Model APIs (W3C DOM), are much better known but suffer from significant downsides. SAX provides only a “push” parser (no writer) that requires a cumbersome event handling setup to obtain the parsed data, and DOM replicates the entire XML document in memory which is not feasible for large files.

In contrast, StAX is a “streaming” API like SAX with minimal memory footprint regardless of data size, but its “pull” model is nearly as easy to use and understand as DOM – and StAX also features a writer. Unless you have a compelling reason to do otherwise, you should probably use StAX by default when dealing with XML in Java. That said, it does have a few drawbacks, such as lack of pretty printing and schema validation. This post shows how to work around them.

Pretty Printing

One disadvantage of the StAX writer is its lack of support for “pretty printing,” i.e. formatting XML with line breaks and perhaps indentation. StAX always outputs XML files as a single enormous line with no added whitespace whatsoever, making manual inspection very cumbersome. You could of course run a reformatter on the generated file but that might add a lot of processing time for large files.

One clever idea is to wrap the writer in a proxy but if you don’t need perfect formatting, it’s much simpler to manually emit a line feed after each XML element end tag. This achieves the main purpose of breaking the XML file into conveniently short lines. Example:

void writeXml(XMLStreamWriter writer) throws XMLStreamException {
    writer.writeStartElement("element");
    …
    writer.writeEndElement();
    writer.writeCharacters(System.lineSeparator());
}

This way you can also elect to only emit line feeds after longer elements, keeping shorter ones together in a single line. Speaking of elements, note that writeEndElement always writes the complete end tag (e.g. </element>), even if the element is empty or contains only attributes. This is undesirable but there’s a simple solution, even though it’s not obvious from the minimal Javadoc entries:

  1. For any XML element that does not contain text or nested elements, replace writeStartElement with writeEmptyElement.
  2. Now write any number of attributes. StAX will not close the element until some non-attribute (and non-namespace) is written!
  3. Writing character data or another element automatically first emits the closing sequence />.

For example, the following code will emit <element a="1" b="2" c="3"/> followed by a line feed:

void writeXml(XMLStreamWriter writer) throws XMLStreamException {
    writer.writeEmptyElement("element");
    writer.writeAttribute("a", Integer.toString(1));
    writer.writeAttribute("b", Integer.toString(2));
    writer.writeAttribute("c", Integer.toString(3));
    // no writeEndElement here!
    writer.writeCharacters(System.lineSeparator());
}

Schema Validation

One disadvantage of the StAX reader is its lack of support for W3C XML Schema (XSD) validation. Happily the Java standard library contains a stand-alone Validator in package javax.xml.validation that accepts both XML/XSD and RELAX NG schemas.

Validator internally runs on top of SAX but that doesn’t need to concern us. Indeed, if you don’t need to extract detailed information on the parsing process and just want to check whether an XML file is valid or not, you can simply run Validator as a black box and catch any SAXException. If one occurs it will contain information on the first discovered validation error.

The following sample shows the complete code for validating some arbitrary InputStream with XML data against an XML schema that’s embedded as a JAR file resource. To use it in conjunction with StAX, simply run the input through validate before reopening the stream and sending it to the StAX parser.

import java.io.*;
import java.net.URL;
import org.xml.sax.SAXException;

import javax.xml.XMLConstants;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.*;

public class ValidatorSample {
    /**
     * Validates the specified {@link InputStream} against the embedded XML schema.
     * Verifies that {@code stream} contains XML data that conforms to the
     * embedded "schema.xsd". Does nothing on success.
     * 
     * @param stream an {@link InputStream} containing XML data
     * @throws IOException if an error occurred while reading {@code stream}
     * @throws NullPointerException if {@code stream} is {@code null}
     * @throws SAXException if {@code stream} failed to validate against "schema.xsd"
     */
    public static void validate(InputStream stream) throws IOException, SAXException {
        if (stream == null)
            throw new NullPointerException("stream");
    
        final SchemaFactory schemaFactory =
                SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        final URL schemaUrl = ValidatorSample.class.getResource("/resources/schema.xsd");
        final Schema schema = schemaFactory.newSchema(schemaUrl);
        final Validator validator = schema.newValidator();
        validator.validate(new StreamSource(stream));
    }
}

One thought on “Java StAX Tips”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.