1: public class StaxExtracter {
2: XMLEventReader reader;
3: boolean headerRead = false;
4:
5: public void openStream(InputStream s) throws XMLStreamException {
6: XMLInputFactory inputFactory = XMLInputFactory.newInstance();
7: reader = inputFactory.createXMLEventReader(s);
8: }
9:
10: public void close() throws XMLStreamException {
11: reader.close();
12: }
13:
14: //function to allow us to read part of a document that does not include the header
15: public void setHeaderRead(boolean val) {
16: headerRead = val;
17: }
Following this we have two main functions. The first is used to read the open XML tag of the document and any important attributes in this tag, and the second is to read an actual record. Since the class is written to read recursively all sub elements of a record, it is important that we not read the open XML tag of our document as a regular record, since that will mean we wind up reading our entire document as one record! However, since there may be need to work on part of a file, i allowed an override on line 15 above so we can skip reading the header. The open XML tag is obviously not a complete XML tag so we do some heavier work here, returning the attributes as a map for use by our application, in addition to the raw line which can be used upstream in our application.
Here is the actual code:1: /**
2: * Since header is not a complete element we need to do special treatment so we will read the header attributes
3: * into a hashmap to save all the attributes besides returning the raw data
4: *
5: * @return the rawData of document start and header
6: * @throws XMLStreamException
7: */
8: public String getHeaderAndAttributesAsMap(Map<String, String> attributes) throws XMLStreamException {
9: boolean documentStart = false;
10: StringBuffer rawData = new StringBuffer();
11: while (reader.hasNext() && !headerRead) {
12: XMLEvent e = reader.nextEvent();
13: if (e.isStartDocument()) {
14: documentStart = true;
15: } else if (e.isStartElement()) {
16: headerRead = true;
17: StartElement startElement = e.asStartElement();
18: for (Iterator i = startElement.getAttributes(); i.hasNext();) {
19: Attribute attr = (Attribute) i.next();
20: attributes.put(attr.getName().getLocalPart(), attr.getValue());
21: }
22: }
23: //skip whitespace before document
24: if (documentStart) {
25: rawData.append(e.toString());
26: }
27: }
28: if (!headerRead)
29: return null;
30: else {
31: return rawData.toString();
32: }
33: }
1: public enum RecordStatus {BODY,NO_DATA,CLOSE_BODY_TAG,END_DOCUMENT}
2:
3: RecordStatus getNextRecord(StringBuffer recordData) throws XMLStreamException {
4: if (!headerRead) {
5: throw new RuntimeException("need to read header line before calling this function");
6: }
7: RecordStatus rc = RecordStatus.NO_DATA;
8: String recordName = null;
9: boolean finishedReadingElement = false;
10: boolean startOfRecord = false;
11: while (reader.hasNext() && !finishedReadingElement) {
12: XMLEvent e = reader.nextEvent();
13: if (e.isStartElement()) {
14: if (!startOfRecord) {
15: startOfRecord = true;
16: recordName = e.asStartElement().getName().getLocalPart();
17: }
18: } else if (e.isEndElement()) {
19: EndElement endElement = e.asEndElement();
20: if (endElement.getName().getLocalPart().equals(recordName)) {
21: finishedReadingElement = true;
22: rc = RecordStatus.BODY;
23: } else if (!startOfRecord) {
24: //todo we could make this better by e adding validation here to save the START_BODY_TAG
25: rc = RecordStatus.CLOSE_BODY_TAG;
26: finishedReadingElement = true;
27: }
28: } else if (e.isEndDocument()) {
29: rc = RecordStatus.END_DOCUMENT;
30: }
31: //skip whitespace before element
32: if (startOfRecord || rc == RecordStatus.CLOSE_BODY_TAG) {
33: if (e.isCharacters())
34: //need to re-escape characters so SAX parser wont choke on it
35: recordData.append(StringEscapeUtils.escapeXml(e.toString()));
36: else
37: recordData.append(e.toString());
38: }
39: }
40: return rc;
41: }
Of course, you could do ALL the parsing in your StAX implementation as well, as described here and other places, but this code allows us to continue to use the more elegant DOM for our actual work.