Rxr for WikiXRay

From Meta, a Wikimedia project coordination wiki

Cut & paste the following code in a text file, and save it as rxr.py. Don't forget to give your file executable privileges.

This program simply reads (by default from the standard input) the standard xml file of a mediawiki dump looking for a certain page with id specified in the string variable firstPageId. When the page is found the software reproduces in the output (by default standard output) a valid xml which can be parsed from WikiXRay parser.

This programs helps to reprise the analysis of a wiki when it has been interrupted before its end.

Its usage is tipically:

  7za e -so enwiki-20100130-pages-meta-history.xml.7z |
           python RikyXRay/rxr.py |
           python dump_sax.py
   

rxr inserts exactly transparently in the middle of the pipe between the decompression and WikiXRay parser parser.

#############################################
#      rxr: a preprocessor for WikiXRay                                              
#############################################
# This program is free software. You can redistribute it and/or modify    
# it under the terms of the GNU General Public License as published by 
# the Free Software Foundation; either version 2 or later of the GPL.     
#############################################
# Author: Riccardo Tasso                         

import sys,codecs,re

def main(input):

	firstPageId = '12345'

	error = open('error.log', 'w')
	
	pageP = re.compile(r'\s*<page>\s*')
	lastPageTag = None
	pageCount = 0
	
	lastTitleTag = None
	
	idP = re.compile(r'\s*<id>(.+)</id>\s*')
	lastIdTag = None
	
	firstPageFound = False
	
	startWriting = False
	
	line = input.readline()
	while line != '':
		
		if not startWriting and pageP.match(line):
			
			firstPageFound = True
			
			lastPageTag = line
			pageCount += 1
			
			if pageCount % 10000 == 0:
				error.write(str(pageCount) + ' pages found\n')
				error.flush()
			
			lastTitleTag = input.readline()
			
			lastIdTag = input.readline()
			pageId = idP.match(lastIdTag).group(1)
			
			if str(firstPageId) == str(pageId):
				startWriting = True
				error.write('work reprise for page id: ' + str(pageId) + '\n')
				print lastPageTag.strip('\n')
				print lastTitleTag.strip('\n')
				print lastIdTag.strip('\n')
				line = input.readline()
				if line == '':
					break
		
		if not firstPageFound or startWriting:
			print line.strip('\n')
		
		line = input.readline()
				
	error.close()
	return 0

if __name__ == '__main__':
	
    # Adapt stdout to Unicode UTF-8
    sys.stdout=codecs.EncodedFile(sys.stdout,'utf-8')
    
    input = sys.stdin
    
    main(input)