I am looking into various XML parsing in Erlang. I found mainly 3 of them.

* Xmerl from Erlang distribution
* Erlsom
* Linked-in driver based on libexpat from ejabberd

I did some benchmarking for 3 parsers. Parsing was done on 42K sized XML.

xmerl took 124ms, erlsom took 28ms and linked-in driver based parser took 7ms.

libexpat based parser is the fastest. But it has some drawbacks.
* It cannot do callbacks for SAX parser as linked-in driver cannot do rpc call into the host VM. So it is not good for parsing huge XML data.
* With this one will loose the platform independence. Need to compile the linked-in driver for the platform one is working on.

Erlsom seems to be better than the default xmerl parser.
* It also generate a Erlang data structure (tuples and list).
* Provides continuation function callback when data not enough data is there for parsing.
* Convert the XML to erlang data structure as per the XS
* SAX based parsing
* Some limitations are listed here

Linked-in driver libexpat based parser is the fastest one. It is not flexible enough. It returns list of tuples where first element of tuple is an integer which indicate the begining of element, end of element and cdata/character content. Some parser is necessary to convert this to xmerl structure or any other desired structure.
* Since this does not provide callbacks, it is not desirable to parse huge files
* DOM generated need re-parsing
* Since it is libexpat based parser, it check for utf-8 validity etc.

I heard that next version of xmerl is going to be faster than what it is now. I have not gotten the latest xmerl parser yet. Once I have it, I will do the benchmarking and do another post on my findings.

UPDATE: xmerl also support validating XML against XSD via xmerl_xsd module.
{ok, Xml} = xmerl_scan:string(XmlString),
{ok, Schema} = xmerl_xsd:proces_schema(XsdFile),
xmerl_xsd:validate(Xml, Schema)

UPDATE: Tried xmerl_scan:file/1 with R13 release. It is a significant improvement in performance. Now it can do 44K file in 35-40ms. Still slower than erlsom and expat based parser.

Advertisements