Sunday, May 17, 2015

XML Parsing & Validation

In my previous post, I discussed how to create an xml from POJOs with namespaces. In here I'm going to describe how to parse and validate an xml. This is the xml which generated by previous example.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="students.xsl"?>
<studentsns:students xmlns:studentsns="http://www.charitha.org/students"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="StudentsSchema.xsd">
 <studentsns:student id="1">
  <studentsns:name>Charitha</studentsns:name>
  <studentsns:student>2009</studentsns:student>
  <studentsns:nic>1234567890v</studentsns:nic>
 </studentsns:student>
 <studentsns:student id="2">
  <studentsns:name>Achithra</studentsns:name>
  <studentsns:student>2009</studentsns:student>
  <studentsns:nic>9876543210v</studentsns:nic>
 </studentsns:student>
 <studentsns:student id="3">
  <studentsns:name>Malan</studentsns:name>
  <studentsns:student>2011</studentsns:student>
  <studentsns:nic>1597532632v</studentsns:nic>
 </studentsns:student>
</studentsns:students>

Also I'm using following xml schema to validate above xml.

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.charitha.org/students"
 xmlns:studentsns="http://www.charitha.org/students" elementFormDefault="qualified">
 <element name="students">
  <complexType>
   <sequence>
    <element ref="studentsns:student" minOccurs="0" maxOccurs="unbounded" />
   </sequence>
  </complexType>
 </element>

 <element name="student">
  <complexType>
   <sequence>
    <element name="name" type="string" />
    <element name="year" type="string" />
    <element name="nic" type="string" />
   </sequence>
   <attribute name="id" type="integer" use="required"/>
  </complexType>
 </element>
</schema>

In here I'm going to explain 3 parser APIs for xml which known as DOM, SAX, and StAX. These parser APIs have their own advantages and also drawbacks. So first I'd like to list these three APIs and then compare and contrast differences between them.
  • DOM - pull the whole thing into memory and walk around inside it. Good for comparatively small chunks of XML that you want to do complex stuff with. XSLT uses DOM.
  • SAX - Walk the XML as it arrives watching for things as they fly past. Good for large amounts of data or comparatively simple processing.
  • StAX - Much like SAX but instead of responding to events found in the stream you iterate through the xml
DOM

First lets look at DOM parser. It uses DocumentBuilderFactory to build the XML document and defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. Since we are going to validate the xml with the schema, we need to aware with Document namespace. I'm using javax.xml.parsers.DocumentBuilder to obtain DOM Document instances from the empty XML document which I instantiated using DocumentBuilderFactory. An instance of DocumentBuilder class can be obtained from the documentBuilderFactory.newDocumentBuilder() method. Once an instance of this class is obtained, XML can be parsed from a variety of input sources. These input sources are InputStreams, Files, URLs, and SAX InputSources. But in here I'm using it only to create new XML document instead of parsing existing one. Then I'm creating a new org.w3c.dom.Document by parsing my xml file using the documentBuilder.

DocumentBuilderFactory documentBuildFactory = DocumentBuilderFactory.newInstance();
documentBuildFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuildFactory.newDocumentBuilder();
Document document = documentBuilder.parse("students.xml");

Now we can validate parsed document against xml schema which mentioned in above. javax.xml.validation.SchemaFactory is a schema compiler. It reads external representations of schemas and prepares them for validation. So we need to specify default namespace for the schema when we call for new instance of schemaFactory. In validation process we create javax.xml.validation.Schema object from schema file by using schemaFactory instance. Schema object represents a set of constraints that can be checked/ enforced against an XML document and it is an immutable in-memory representation of grammar.

SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(new File("studentsSchema.xsd"));
Validator validator = schema.newValidator();
validator.validate(new DOMSource(document));

Then we can extract elements from parsed document using root element of the document. Since our xml has Student list, I'm using ArrayList to hold each student record as POJO and NodeList to hold student elements from parsed xml.

ArrayList < Student > students = new ArrayList < Student > ();

// Extract nodes list from the xml file
NodeList nodes = document.getDocumentElement().getChildNodes();

// Iterates through all nodes
for (int i = 0; i < nodes.getLength(); i++) {
 Node node = nodes.item(i);
 if (node.getNodeType() == Node.ELEMENT_NODE) {
  Student student = new Student();
  student.setStudentId(Integer.parseInt(node.getAttributes()
   .getNamedItem("id")
   .getNodeValue()));

  NodeList childNodes = node.getChildNodes();

  for (int j = 0; j < childNodes.getLength(); j++) {
   Node childNode = childNodes.item(j);
   if (childNode.getNodeType() == Node.ELEMENT_NODE) {
    // extract the content
    String elmBody = childNode.getLastChild().getTextContent();
    // check type of the content
    switch (childNode.getNodeName()) {
     case "studentsns:name":
      student.setStudentName(elmBody);
      break;
     case "studentsns:year":
      student.setRegYear(Integer.parseInt(elmBody));
      break;
     case "studentsns:nic":
      student.setNic(elmBody);
      break;
     default:
      break;

    }
   }
  }
  students.add(student);
 }
}

SAX

Implementation of SAX validation is quite simple than implementing a DOM validation. We are using schema factory as same as the DOM validation. But when we are parsing our xml, we don't want to create DOM source from our xml. Instead of that we are passing javax.xml.transform.stream.StreamSource.StreamSource from xml File to validator.

SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(new File("studentsSchema.xsd"));
Validator validator = schema.newValidator();
validator.validate(new StreamSource(new File("students.xml")));

Since SAX parsing is event driven, we need to use event handler to handle events and build POJOs from xml. So we are creating separate handler class by extending org.xml.sax.helpers.DefaultHandler class. When parser starts parsing with a xml stream, startDocument method called and once stream ended, endDocument method called. Once parser detects start or end of element, it called startElement and endElement methods respectively with following parameters.

  • uri The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
  • localName The local name (without prefix), or the empty string if Namespace processing is not being performed.
  • qName The qualified name (with prefix), or the empty string if qualified names are not available.

In startElement method, there is an additional parameter is there to specify the attributes attached to the element. If there are no attributes, it shall be an empty Attributes object. Each element body content can be extracted through the characters method. Finally if there any exception occurred during parsing, error method is called.

public class StudentsSAXHandler extends DefaultHandler {

 private ArrayList<Student> students;
 private Student student;
 private String elmBody;

 @Override
 public void startDocument() throws SAXException {
  students = new ArrayList<Student>();
 }

 @Override
 public void endDocument() throws SAXException {
  // printing student details
  for (Student s : students) {
   System.out.println(s.toString());
  }
 }

 @Override
 public void startElement(String uri, String localName, String qName,
                          Attributes attributes) throws SAXException {
  if ("student".equals(localName)) {
   student = new Student();
   student.setStudentId(Integer.parseInt(attributes.getValue("id")));
  }
 }

 @Override
 public void endElement(String uri, String localName, String qName) throws SAXException {
  switch (localName) {
   case "student":
    students.add(student);
    break;
   case "name":
    student.setStudentName(elmBody);
    break;
   case "year":
    student.setRegYear(Integer.parseInt(elmBody));
    break;
   case "nic":
    student.setNic(elmBody);
    break;
   default:
    break;
  }
 }

 @Override
 public void characters(char ch[], int start, int length) throws SAXException {
  elmBody = new String(ch, start, length);
 }

 @Override
 public void error(SAXParseException e) {
  System.err.println("Parsing error: " + e.getMessage());
 }

}

Now we have handler for the SAX parser, so then we are going to create parser with that handler. Since we already validate our document with SAX validator, we just only need to create instance from javax.xml.parsers.SAXParserFactory to parse our xml with our namespace. Then we are creating javax.xml.parsers.SAXParser. For ease of transition, this class continues to support the same name and interface as well as supporting new methods. An instance of this class can be obtained from the javax.xml.parsers.SAXParserFactory.newSAXParser() method. Once an instance of this class is obtained, xml can be parsed from a variety of input sources. These input sources are InputStreams, Files, URLs, and SAX InputSources. In here I'm using xml file as input source.

// Get SAX Parser Factory
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
factory.setSchema(schema);
SAXParser parser = factory.newSAXParser();
parser.parse(new File("students.xml"), new StudentsSAXHandler());

StAX

In StAX validation, javax.xml.stream.XMLInputFactory is used to build xml from input stream or from file. Then using xml input factory we are creating javax.xml.stream.XMLStreamReader interface object and it is used to build javax.xml.transform.stax.StAXSource.StAXSource which can be validate using the validator.

XMLInputFactory factory = XMLInputFactory.newInstance();
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
XMLStreamReader xmlStreamReader = factory.createXMLStreamReader(new FileInputStream("students.xml"));
Schema schema = schemaFactory.newSchema(new File("studentsSchema.xsd"));
Validator validator = schema.newValidator();
validator.validate(new StAXSource(xmlStreamReader));

Then we only have to iterate through each element in node in reader object to build our POJOs by parsing xml.

ArrayList < Student > students = null;
Student student = null;
String elmBody = null;

while (xmlStreamReader.hasNext()) {
 switch (xmlStreamReader.next()) {
  case XMLStreamConstants.START_DOCUMENT:
   students = new ArrayList < Student > ();
   break;
  case XMLStreamConstants.START_ELEMENT:
   if ("students".equals(xmlStreamReader.getLocalName())) {
    students = new ArrayList < > ();
   } else if ("student".equals(xmlStreamReader.getLocalName())) {
    student = new Student();
    student.setStudentId(Integer.parseInt(xmlStreamReader.getAttributeValue(0)));
   }
   break;
  case XMLStreamConstants.CHARACTERS:
   elmBody = xmlStreamReader.getText().trim();
   break;
  case XMLStreamConstants.END_ELEMENT:
   switch (xmlStreamReader.getLocalName()) {
    case "student":
     students.add(student);
     break;
    case "name":
     student.setStudentName(elmBody);
     break;
    case "year":
     student.setRegYear(Integer.parseInt(elmBody));
     break;
    case "nic":
     student.setNic(elmBody);
     break;
    default:
     break;
   }
   break;
  default:
   break;
 }
}

1 comment: