User:Tbayer (WMF)/Converting Google Docs to wikitext

From Meta, a Wikimedia project coordination wiki
  1. Export the document from Google Drive as ODT (File -> Download as -> OpenDocument Format (.odt))
  2. Install (LibreOffice and) the "Wiki Publisher" extension for LibreOffice (available e.g. in the Ubuntu software store)
  3. Export the document from LibreOffice as MediaWiki wikitext (File -> Export -> File type: MediaWiki)
  4. The resulting wikitext file should have most of the formatting preserved - even tables. But there is an annoying bug/feature making links that look like this in GDocs [1]looklikethis in the exported wikitext. To fix these, and also remove extraneous blank lines, run this little script (Python needs to be installed):
python gdocodtwikimultilfix.py gdocodtwiki.txt gdocodtwiki_fixed.txt

where gdocodtwikimultilfix.py is the following (save it locally as a text file in the directory where you want to do the conversion):

#!/usr/bin/python
# Short script to take wikitext generated by the "Wiki Publisher" extension
# for LibreOffice from an ODT file exported from Google Docs
# and fix duplicated external links, as well as remove extraneous blank lines
# By T. Bayer ([[user:HaeB]])

import os
import sys
import re
import codecs


class gdocodtmultilfixerror(Exception):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return repr(self.value)

if len(sys.argv) < 3 or len(sys.argv) > 3:
    raise gdocodtmultilfixerror('needs exactly two command line arguments:  1. input file (non-fixed wikitext, output of Wiki Publisher) 2. output files (fixed wikitext)')

urlpattern = r'https?://[^\ ]*'

inputfilename = sys.argv[1]
outputfilename = sys.argv[2]

inputfile = codecs.open(inputfilename, mode='r', encoding='utf-8')
outputfile = codecs.open(outputfilename, mode='w', encoding='utf-8')


precedingline = '\n'

for line in inputfile:

    m = line
  
    urls = set(re.findall(urlpattern, m))

    for url in urls:
        # Somehow, the space before an external link gets moved into the link during the export process.
        # Replace this ('[http://www.example.com  ]' --> ' ')
        urle = re.escape(url)
	old = r'([^\]])\['+urle+r'  \]\['+urle
	new = r'\1 ['+url
	# Collapse duplicated links:
	m = re.sub(old, new, m)
	old = r'\['+urle+'( [^\\]]*)]\['+urle+' '
	new = u'['+url+r'\1'
	while re.search(old, m):
		m = re.sub(old, new, m)

    # Collapse multiple blank lines to one:
    if not (precedingline == '\n' and m == '\n'):
        outputfile.write(m)

    precedingline = m

    
inputfile.close()
outputfile.close()

There may be still be other formatting errors (e.g. bolded text that is not bolded in the original, or vice versa), but for longer documents this solution can save a lot of time compared to manual conversion.

One may consider turning off "smart quotes" in Google Docs ("Tools" -> "Preferences" -> uncheck "Use smart quotes").

See also[edit]