Thursday, April 28, 2011

Using StAX to extract records from an XML file

When processing large XML files, it is impossible to build a DOM of the whole file as you are sure to run out of memory. Java 6 includes StAX (it was around earlier of courser) and it is the best solution to deal with large XML files to build a scalable XML file processor. However, it still may be desirable to build a Document Model when processing an individual record due to decisions in code that may need to be made based on information lower in the node. Due to this requirement, we use StAX to extract each record from our gargantuan XML files and then pass to a different class that builds a DOM for the processing the record. Here is the very simple code with explanation of what we did.


   1: public class StaxExtracter {



   2:     XMLEventReader reader;



   3:     boolean headerRead = false;



   4:  



   5:     public void openStream(InputStream s) throws XMLStreamException {



   6:         XMLInputFactory inputFactory = XMLInputFactory.newInstance();



   7:         reader = inputFactory.createXMLEventReader(s);



   8:     }



   9:  



  10:     public void close() throws XMLStreamException {



  11:         reader.close();



  12:     }



  13:  



  14:     //function to allow us to read part of a document that does not include the header



  15:     public void setHeaderRead(boolean val) {



  16:         headerRead = val;



  17:     }


The above is some boilerplate code needed for our class. The heavy work is the XMLEventReader, created in the openStream function on line 7.

Following this we have two main functions. The first is used to read the open XML tag of the document and any important attributes in this tag, and the second is to read an actual record. Since the class is written to read recursively all sub elements of a record, it is important that we not read the open XML tag of our document as a regular record, since that will mean we wind up reading our entire document as one record! However, since there may be need to work on part of a file, i allowed an override on line 15 above so we can skip reading the header. The open XML tag is obviously not a complete XML tag so we do some heavier work here, returning the attributes as a map for use by our application, in addition to the raw line which can be used upstream in our application.

Here is the actual code:


   1: /**



   2:      * Since header is not a complete element we need to do special treatment so we will read the header attributes



   3:      * into a hashmap to save all the attributes besides returning the raw data



   4:      *



   5:      * @return the rawData of document start and header



   6:      * @throws XMLStreamException



   7:      */



   8:     public String getHeaderAndAttributesAsMap(Map<String, String> attributes) throws XMLStreamException {



   9:         boolean documentStart = false;



  10:         StringBuffer rawData = new StringBuffer();



  11:         while (reader.hasNext() && !headerRead) {



  12:             XMLEvent e = reader.nextEvent();



  13:             if (e.isStartDocument()) {



  14:                 documentStart = true;



  15:             } else if (e.isStartElement()) {



  16:                 headerRead = true;



  17:                 StartElement startElement = e.asStartElement();



  18:                 for (Iterator i = startElement.getAttributes(); i.hasNext();) {



  19:                     Attribute attr = (Attribute) i.next();



  20:                     attributes.put(attr.getName().getLocalPart(), attr.getValue());



  21:                 }



  22:             }



  23:             //skip whitespace before document



  24:             if (documentStart) {



  25:                 rawData.append(e.toString());



  26:             }



  27:         }



  28:         if (!headerRead)



  29:             return null;



  30:         else {



  31:             return rawData.toString();



  32:         }



  33:     }


Now, on to the main function for reading each element of our XML file:


   1: public enum RecordStatus {BODY,NO_DATA,CLOSE_BODY_TAG,END_DOCUMENT}



   2:  



   3: RecordStatus getNextRecord(StringBuffer recordData) throws XMLStreamException {



   4:     if (!headerRead) {



   5:         throw new RuntimeException("need to read header line before calling this function");



   6:     }



   7:     RecordStatus rc = RecordStatus.NO_DATA;



   8:     String recordName = null;



   9:     boolean finishedReadingElement = false;



  10:     boolean startOfRecord = false;



  11:     while (reader.hasNext() && !finishedReadingElement) {



  12:         XMLEvent e = reader.nextEvent();



  13:         if (e.isStartElement()) {



  14:             if (!startOfRecord) {



  15:                 startOfRecord = true;



  16:                 recordName = e.asStartElement().getName().getLocalPart();



  17:             }



  18:         } else if (e.isEndElement()) {



  19:             EndElement endElement = e.asEndElement();



  20:             if (endElement.getName().getLocalPart().equals(recordName)) {



  21:                 finishedReadingElement = true;



  22:                 rc = RecordStatus.BODY;



  23:             }  else if (!startOfRecord) {



  24:                 //todo  we could make this better by e adding validation here to save the START_BODY_TAG



  25:                 rc = RecordStatus.CLOSE_BODY_TAG;



  26:                 finishedReadingElement = true;



  27:             }



  28:         } else if (e.isEndDocument()) {



  29:             rc = RecordStatus.END_DOCUMENT;



  30:         }



  31:         //skip whitespace before element



  32:         if (startOfRecord || rc == RecordStatus.CLOSE_BODY_TAG) {



  33:             if (e.isCharacters())



  34:                 //need to re-escape characters so SAX parser wont choke on it



  35:                 recordData.append(StringEscapeUtils.escapeXml(e.toString()));



  36:             else



  37:                 recordData.append(e.toString());



  38:         }



  39:     }



  40:     return rc;



  41: }


We use an enum to give us maximal information about what was read while return the actual raw data in a StringBuffer that is allocated outside our function.

Of course, you could do ALL the parsing in your StAX implementation as well, as described here and other places, but this code allows us to continue to use the more elegant DOM for our actual work.