Processing MediaWiki XML with STX

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Blue Glass Arrow.svg MediaWiki logo.png
This page should be moved to
Please do not move the page by hand. It will be imported by a administrator with the full edit history. In the meantime, you may continue to edit the page as normal.

Especially huge XML files exported by MediaWiki are too large to be processed with powerful transformation languages tools like XSLT. You can use a SAX-implementing parser that should be available in almost any language of your choice. You can also try to directely parse parts of the XML code but this method is very difficult to maintain. An alternative is Streaming Transformations for XML(STX), a one-pass transformation language for XML documents. You can also combine STX and XSLT.

STX enables the processing of large documents and streams. For instance you can process the dump

zcat pages_full.xml.gz | java -jar joost.jar - myscript.stx

STX Scripts[edit]