Processing MediaWiki XML with STX
Jump to navigation Jump to search
Especially huge XML files exported by MediaWiki are too large to be processed with powerful transformation languages tools like XSLT. You can use a SAX-implementing parser that should be available in almost any language of your choice. You can also try to directely parse parts of the XML code but this method is very difficult to maintain. An alternative is Streaming Transformations for XML(STX), a one-pass transformation language for XML documents. You can also combine STX and XSLT.
STX enables the processing of large documents and streams. For instance you can process the dump
zcat pages_full.xml.gz | java -jar joost.jar - myscript.stx
- /Add namespaces - a first example
- /Page ids - get page ids (useful to analyse link tables)
- /Extract templates - used to extract special templates