Wikimedia Blog/Converting wiki pages to blog posts

From Meta, a Wikimedia project coordination wiki

How to convert a wiki page on a Wikimedia wiki into a Wordpress posting on the Wikimedia blog[edit]

This is now also available as an online tool in Tools Labs: https://tools.wmflabs.org/blogconverter/

Use the below Python cleanup script, as follows (replacing "https://meta.wikimedia.org/wiki/Main_Page" with the URL of the wiki page you want to convert):

python blogfix.py https://meta.wikimedia.org/wiki/Main_Page wikipageclean.html

or

./blogfix.py https://meta.wikimedia.org/wiki/Main_Page wikipageclean.html

Then, paste the content of "wikipageclean.html" into Wordpress.

If you already know the URL of the resulting blog post ("https://blog.wikimedia.org/.../"), you can optionally pass it as a third parameter. The script will then automatically fix internal anchor links (like from a TOC to the corresponding sections, or between footnotes and the main text).

blogfix.py is the following script (save it locally as a text file in the directory where you want to do the conversion, and then make it executable):

#!/usr/bin/python

# Short script to take the HTML from a Wikimedia wiki and turn it into HTML suitable for posting to the Wikimedia blog
# Copyright 2011-2017 by T. Bayer ([[user:HaeB]])
# Rewritten from the original script by [[user:RobLa]], modified by [[user:guillom]]: https://www.mediawiki.org/w/index.php?title=Wikimedia_Engineering/Report&oldid=414389#Porting_the_report_to_WordPress
# Go to https://meta.wikimedia.org/wiki/Wikimedia_Blog/Converting_wiki_pages_to_blog_posts for the current version and some documentation

    # This program is free software; you can redistribute it and/or
    # modify it under the terms of the GNU General Public License
    # as published by the Free Software Foundation; either version 2
    # of the License, or (at your option) any later version.
    #
    # This program is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    # GNU General Public License for more details.
    # 
    # You should have received a copy of the GNU General Public License
    # along with this program; if not, write to the Free Software
    # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.

import os
import sys
import re
import urllib2
import codecs


usageexplanation = 'usage: blogfix.py sourceurl outputfile [targeturl]\n\
 * \"sourceurl\" denotes a page on a MediaWiki wiki such as meta.wikimedia.org\n\
 * \"outputfile\" is the name output file (which will contain a "clean" version of the page\n\
   suitable for (e.g.) usage on a Wordpress blog such as blog.wikimedia.org)\n\
 * "targeturl" (optional) denotes the page (e.g. blog post) where the output will live\n\
   (for converting internal, relative anchor links into absolute anchor links)'
# TODO: nicer formatting


class blogfixerror(Exception):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return repr(self.value)

if len(sys.argv) < 2 or len(sys.argv) > 4:
    raise blogfixerror(usageexplanation)

sourceurl = sys.argv[1]

sourceurlpattern = r'https?://([^\/]*)/wiki/(.*)'

if not re.match(sourceurlpattern, sourceurl):
    raise blogfixerror('Wrong URL format: "'+sourceurl+'"\n'+usageexplanation)

wikidomain = re.match(sourceurlpattern, sourceurl).group(1)
# wiki where the source page is taken from,
# e.g. wikidomain = 'meta.wikimedia.org' or wikidomain = 'www.mediawiki.org'


sourcepagename = re.match(sourceurlpattern, sourceurl).group(2)
# wiki page to be converted 
# e.g. sourcepagename = 'Research:Newsletter/2012/September'


outputfilename = sys.argv[2]


if len(sys.argv) == 4:
    targeturl = sys.argv[3]
else:
    targeturl = ''
# URL where the converted page will live
# e.g. targeturl = 'https://blog.wikimedia.org/2012/09/27/wikimedia-research-newsletter-september-2012/'
# if left empty, anchor links to sections or footnotes within the page may need to be fixed by hand

nolocalimages = True
# Set nolocalimages = True if all images are from Wikimedia Commons
# If nolocalimages = False, images will link to local file description
# pages instead of Commons


User_agent = 'blogfix.py'
headers = { 'User-Agent' : User_agent }

renderedsourceurl = 'https://'+wikidomain+'/w/index.php?title='+sourcepagename+'&action=render'
# cf. https://www.mediawiki.org/wiki/Manual:Parameters_to_index.php#render
# (alternative?: urlresponse = urllib2.urlopen('https://'+wikidomain+'/w/api.php?action=parse&page='+sourcepagename)
# cf. https://www.mediawiki.org/wiki/API:Parse )

req = urllib2.Request(renderedsourceurl, None, headers)
urlresponse = urllib2.urlopen(req)
sourcehtml = urlresponse.read()
f = unicode(sourcehtml, 'utf-8').splitlines()


outputfile = codecs.open(outputfilename, mode='w', encoding='utf-8')

outputfile.write('<meta charset="UTF-8" /><!--for testing this output file standalone in the browser - this line should be removed when importing into Wordpress-->\n')

relanchors = 0


for line in f:

    m = line.strip()
    
    # remove edit section links:
    # This relies on the presence of the "mw-editsection-bracket" span
    # (activated with https://gerrit.wikimedia.org/r/#/c/64365/ in summer 2013)
    # See previous versions of this script for more complicated code that
    # handles section edit links without this span
    # See also https://meta.wikimedia.org/wiki/Change_to_section_edit_links#Technical_information (April 2013)

    old = r'<span class="mw-editsection">.*<span class="mw-editsection-bracket">]</span></span>'
    new = ''
    m = re.sub(old, new, m)


    # Specifically for the WMF Engineering reports:
    old = r'<span class="plainlinks noprint mw-statushelper-editlink" [^\>]*>\[<a class="external text" href="[^\"]*">edit</a>]</span>'
    new = ''
    m = re.sub(old, new, m)

    
    # Make internal links (relative to the domain) into external (absolute) links:
    old = r'href="/wiki'
    new = r'href="https://'+wikidomain+'/wiki'
    m = re.sub(old, new, m)
    old = r'( rel="nofollow" class="external text"| class="external text" rel="nofollow"| rel="nofollow" class="external autonumber"| rel="nofollow" class="external free"| class="external free" rel="nofollow"| class="mw-headline"| class="extiw")'
    new = r''
    m = re.sub(old, new, m)
    old = r'( class="external text"| class="external free"| class="external autonumber")' # matches external links to Wikimedia sites without Nofollow
    new = r''
    m = re.sub(old, new, m)
    
    # thumbnail layout (originally from http://bits.wikimedia.org/meta.wikimedia.org/load.php?debug=false&lang=en&modules=ext.wikihiero|mediawiki.legacy.commonPrint%2Cshared|skins.vector&only=styles&skin=vector&* , 
    # borders removed to match the blog's styles )
    old = r'class="center"'
    new = r'style="text-align:center;"'
    m = re.sub(old, new, m)
    old = r'class="thumb tright"'
    new = r'style="text-align:center;border:0 solid #ccc;margin:.5em 0 .8em 1.4em;float:right;clear:right;"'
    m = re.sub(old, new, m)
    old = r'class="thumb tleft"'
    new = r'style="text-align:center;border:0 solid #ccc;margin:0.5em 1.4em 0.8em 0;float:left;clear:left;"'
    m = re.sub(old, new, m)
    old = r'class="thumbinner" style="'
    new = r'style="padding:3px!important;border:0 solid #cccccc;text-align:center;overflow:hidden;font-size:94%;background-color:white;display:block;margin-left:auto;margin-right:auto;'
       # 'display: block; margin-left: auto; margin-right: auto;'  taken from aligncenter" in Wordpress CSS
    m = re.sub(old, new, m)
    # blog now has an "image" class too, which clashes with MediaWiki's:
    old = r'class="image"'
    new = r''
    m = re.sub(old, new, m)
    old = r'class="thumbimage"'
    new = r'style="border:1px solid #ccc;"'
    m = re.sub(old, new, m)
    old = r'class="thumbcaption"'
    new = r'style="border:none;text-align:left;line-height:1.4em;padding:3px !important;font-size:94%;"'
    m = re.sub(old, new, m)
    old = r'class="magnify"'
    new = r'style="float:right;border:none !important;background:none !important;"'
    m = re.sub(old, new, m)
    # replace unstable version of magnify icon from bits (URL contains {{STYLEPATH}} which depends on MW revision) with stable version from Commons
    old = r'src="(http)?s?:?//bits.wikimedia.org/static-([^\/]*)/skins/common/images/magnify-clip.png"'
    new = r'src="https://upload.wikimedia.org/wikipedia/commons/6/6b/Magnify-clip.png"'
    m = re.sub(old, new, m)
    # replace unstable version of (old) video player icon from bits (URL contains {{STYLEPATH}} which depends on MW revision) with stable version from Commons
    #  (<button>...</button> for player still needs to be fixed by hand, in case it's present in the HTML)
    old = r'src="(http)?s?:?//bits.wikimedia.org/static-([^\/]*)/extensions/OggHandler/play.png"'
    new = r'src="https://upload.wikimedia.org/wikipedia/commons/9/96/Crystal_Project_Player_play.png"'
    m = re.sub(old, new, m)
    # video player popup:
    # from https://bits.wikimedia.org/meta.wikimedia.org/load.php?debug=false&lang=en&modules=ext.gadget.CentralAuthInterlinkFixer%2Cteahouse%2Cwm-portal-preview%7Cext.uls.nojs%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmw.PopUpMediaTransform%7Cskins.vector&only=styles&skin=vector&*
    # todo: still need to make the actual (popup) lightbox work (involving the mediaContainer and kskin classes?), currently this just plays the video in new tab
    old = r'class="PopUpMediaTransform" style="'
    new = r'style="position:relative;display:inline-block;'
    m = re.sub(old, new, m)    
    # video player button, using stable icon from Commons (CSS contains an unstable icon URL that contains {{STYLEPATH}} which depends on MW revision)
    # from https://bits.wikimedia.org/meta.wikimedia.org/load.php?debug=false&lang=en&modules=ext.gadget.CentralAuthInterlinkFixer%2Cteahouse%2Cwm-portal-preview%7Cext.uls.nojs%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmw.PopUpMediaTransform%7Cskins.vector&only=styles&skin=vector&*
    old = r'<span class="play-btn-large">'
    new = r'<span style="position:absolute;top:50%;left:50%;width:70px;height:53px;margin-left:-35px;margin-top:-25px;background-image:url(); background: url(https://upload.wikimedia.org/wikipedia/commons/b/ba/Player_big_play_button.png); cursor: pointer; border: none !important; z-index: 1;">'
    m = re.sub(old, new, m)
    # gallery layout from https://bits.wikimedia.org/meta.wikimedia.org/load.php?debug=false&lang=en&modules=ext.gadget.wm-portal-preview%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskins.vector&only=styles&skin=vector&*
    old = r'class="gallery"'
    new = r'style="margin:2px;padding:2px;display:block;"'
    m = re.sub(old, new, m)
    old = r'class="gallerybox"'
    new = r'style="vertical-align:top;display:-moz-inline-box;display:inline-block;"'
    m = re.sub(old, new, m)
    old = r'class="gallerytext"'
    new = r'style="overflow:hidden;font-size:94%;padding:2px 4px;word-wrap:break-word;"'
    m = re.sub(old, new, m)
    old = r'class="thumb"'   # thumb within gallerybox (may not apply elsewhere)
    new = r'style="text-align:center;border:1px solid #ccc;margin:2px;"'
    m = re.sub(old, new, m)
    # from https://meta.wikimedia.org/w/load.php?debug=false&lang=en&modules=ext.cite.styles%7Cext.echo.badgeicons%7Cext.echo.styles.badge%7Cext.gadget.CentralAuthInterlinkFixer%2CCurIDLink%2CNavigation_popups%2CUTCLiveClock%2CaddMe%2CformWizard%2Cteahouse%2Cwm-portal-preview%7Cext.tmh.thumbnail.styles%7Cext.uls.nojs%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector :
    old = r'class="floatright"'
    new = r'style="float:right;clear:right;position:relative;margin:0.5em 0 0.8em 1.4em;"'
    m = re.sub(old, new, m)
    old = r'class="floatleft"'
    new = r'style="float:left;clear:left;position:relative;margin:0.5em 1.4em 0.8em 0;"'
    m = re.sub(old, new, m)
        


    # <syntaxhighlight lang=html5> on mediawiki.org:
    old = r'class="sc2"' 
    new = r'style="color: #009900;"'
    m = re.sub(old, new, m)
    old = r'class="kw2"' 
    new = r'style="color: #000000; font-weight: bold;"'
    m = re.sub(old, new, m)
    old = r'class="kw3"' 
    new = r'style="color: #000066;"'
    m = re.sub(old, new, m)
    old = r'class="sy0"' 
    new = r'style="color: #66cc66;"'
    m = re.sub(old, new, m)
    old = r'class="st0"' 
    new = r'style="color: #ff0000;"'
    m = re.sub(old, new, m)
    
   
       
    # protocol-relative URLs
    old = r'href="//'
    new = r'href="https://'
    m = re.sub(old, new, m)
    old = r'src="//'
    new = r'src="https://'
    m = re.sub(old, new, m)
    # for srcset="...:
    old = r'"//upload.wikimedia.org/wikipedia/commons/'
    new = r'"https://upload.wikimedia.org/wikipedia/commons/'
    m = re.sub(old, new, m)
    old = r' //upload.wikimedia.org/wikipedia/commons/'
    new = r' https://upload.wikimedia.org/wikipedia/commons/'
    m = re.sub(old, new, m)
    

    if nolocalimages: # link directly to file description page on Commons, instead of local file description page
        old = r'://'+wikidomain+'/wiki/File:'   
        new = r'://commons.wikimedia.org/wiki/File:'
        m = re.sub(old, new, m)

    
    # relative anchor links - to sections or footnotes within the page itself - wouldn't always work
    #   (e.g. if only an excerpt of the blog post is displayed on the main page of the blog, or on en.planet.wikimedia.org)
    # so replace them with absolute anchor links to the target page
    old = r'href="#'
    if re.search(old, m):
        relanchors = relanchors + 1 # (todo: if relanchors>0 in the end, should warn user later that there are anchor links that need to be replaced by hand)
        new = r'href="'+targeturl+'#'
        m = re.sub(old, new, m)
        
    # escape magic word that would generate an archive list of blog postings in Wordpress
    old = r'\[archives\]'
    new = r'[<!--  -->archives]'
    m = re.sub(old, new, m)
    
    # TODO: Strip HTML comments at the end of the file, starting from "NewPP limit report"
    
    
    outputfile.write(m+'\n')

outputfile.close()

(released under the GPL v2)

Known issues[edit]

  • The script should in principle work for all public Wikimedia wikis, but has only been tested with Meta and Mediawiki.org.
  • It is assumed that all images are hosted on Commons, and the script adjusts the links to the image description page to point to Commons instead of the local wiki. If this is not desired, set nolocalimages = False as described in the code. And depending on taste, one may want to remove the "magnify" icon on thumbnails, or otherwise modify the wiki layout for the blog. It is assumed that the URL of the embedded thumbnails should be reasonably stable, but for some purposes some users prefer to upload local copies of the thumbnails to the blog - this script doesn't do that.
  • Embedded videos play fine, opening in a new browser tab, but the lightbox popup of the current (November 2013) video player is not preserved for videos with less than 800px width. An alternative is to embed videos using iframes, see the instructions here (this makes for shorter, cleaner HTML, but does not allow selecting the thumbnail still with the thumbtime= parameter). The start= ../ end=... parameters are not supported in either version.
  • Galleries appear (as of March 2015) to display correctly in Firefox but not in Chrome, probably due to a newer user agent stylesheet that is not taken into account yet.
  • The scripts add a line in the beginning ("<meta charset="UTF-8" /> ...") which only serves to enable testing of the file standalone (locally in the browser). This line should be removed when importing the HTML into Wordpress.
  • As of May 2015, the script leaves in the server-generated HTML comments at the end of the file, starting from "NewPP limit report". One may want to remove them manually for cosmetic reasons.
  • When using the optional third parameter for the URL where the post will be published eventually, do not leave out the concluding slash "/" or some browsers will not recognize the resulting anchor links (e.g. [1] vs. [2]).
  • If the name (URL) of the wiki page contains certain special characters (e.g. "!"), this may cause problems. Consider renaming (moving) the page.
  • Tables may lose their borders (this appears to be a general issue with the standard CSS in WordPress). They can be readded using inline styles (e.g.: [3] / [4]). Ideally this should be automated by the script too.
  • If the wiki page was a joint work by several authors (check the edit history of the wiki page for the actual list), you may consider adding a footer which lists them, e.g. in order to honor the attribution requirements of a CC-BY-SA license. Example:
<hr /><em>This article was written by Wikimedia engineers &amp; managers. See <a title="revision history" href="http://www.mediawiki.org/w/index.php?title=Wikimedia_engineering_report/2011/April&action=history">full revision history</a>. A <a href="http://www.mediawiki.org/wiki/Wikimedia_engineering_report/2011/April" title="report on mediawiki.org">wiki version</a> is also available.</em>

See also[edit]