User:Jimbojw/Wiki2PDF

From Meta, a Wikimedia project coordination wiki

This article describes the Wiki2PDF extension - a MediaWiki extension for exporting PDF documents.

Overview[edit]

This extension does two things:

  1. It adds a link in the content actions bar (next to 'history', 'edit', etc) called 'pdf'.
  2. Hooks the 'UnknownAction' extension point and implements 'action=pdf'.

Other Similar Projects[edit]

I had hoped to be able to use an existing MediaWiki to PDF extension, but I was unable to find one that met my needs. Here are some others that I found:

  • HTML2FPDF and Mediawiki - The initial inspiration for this extension. Requires MediaWiki hacking (not a true extension), and doesn't address a number of quirks in HTML2FPDF. (see #Quirks)
  • MediaWiki PDF export and PDF Export - Has an extension style version, however it requires HTMLDOC which (while open-source) is not available from the creators as a Windows binary for free. Also, since it's not native PHP, it requires that your web-server has rights to execute third-party executables.
  • wiki2PDF - A standalone PHP/Python project to rip external wiki pages into PDF. Project appears to be defunct.
    ...If you know of others, please tell me about them on my talk page.

Quirks[edit]

There are some very serious quirks you should be aware of when choosing to use this extension. Most stem from the HTML2PDF library which is its prerequisite. They include:

  1. Inline CSS style attributes are the only stlyes even partially supported.
  2. Many inline styles are not respected at all.
  3. Tables inside <li> items render before the item which is supposed to contain them.
  4. First level <ol> and <ul> lists both use bullets (no numbering).
  5. Any <span> tag inside an <li> will force the parent to 'display:inline'. Fixed by converting all spans to <font> tags.
  6. Lists (<ol> and <ul>) inside tables render as though they had display:inline and list-style-type:none.
  7. Links with HREFs without a single '.' character are modified during PDF creation. The resulting links still have link style (appear in blue with underline on hover), but lack any target. For example, the links http://localhost or http://a/b/c/ wouldn't work while http://a.com/, http://a/./c or http://./ would. I have tracked this down to a design bug in HTML2FPDF. On line 1927 of the revised version, we see
    if (strpos($vetor[1],".") === false) //assuming every external link has a dot indicating extension (e.g: .html .txt .zip www.somewhere.com
    Since this is a false assumption, this line should be changed to:
    if (!preg_match('/^(https?|ftp|file):\\/\\//',$vetor[1])) //assuming every external link begins with a protocol
    Though the above works experimentally, there may be better ways to test for external links. Whether this bug can be fixed at the source, or whether it can be easily overridden has yet to be determined.
  8. PNG images with Alpha channels choke the FPDF engine. It is not clear at this time whether this can be corrected at the source or easily fixed via extending the FPDF/HTML2FPDF classes.
    ...More to come as I discover them, or as they are discussed on my talk page.

Future Enhancements[edit]

As time permits, I hope on implementing the following:

  1. Revision History section - Either at the bottom of the document or immediately following the TOC. Could be set to ignore minor edits or restrict to only the latest X revisions.
  2. Linking TOC - Have TOC entries link internally to the headings in the document.
  3. More elaborate footer - Expand on the footer to have more information, like the normal MW footer (times visited etc).
  4. Use Objectcache to store PDFs rather than generating them every time.
  5. Integrate site Copyright info or separate PDF Copyright if necessary.
  6. Utilize CSS - Have HTML2FPDF read an article called "MediaWiki:Wiki2PDF.css" and apply styles.
  7. ObjectOrientify - Leverage OO concepts to make extensibility cleaner.
    ...More to come as I discover them or as they are suggested on my talk page.

wiki2pdf.php[edit]

<?php
/*
 * wiki2pdf.php - A MediaWiki extension for adding PDF Exportability.
 * @author Jim R. Wilson
 * @copyright Copyright (C) 2006 Jim R. Wilson
 * @license http://www.opensource.org/licenses/mit-license.php MIT License
 * -----------------------------------------------------------------------
 * Description:
 *     This is a MediaWiki (http://www.mediawiki.org/) extension script 
 *     which adds support for the exporting pages as PDF documents.
 *     It relies on html2fpdf (http://sourceforge.net/projects/html2fpdf/).
 * Requirements:
 *     1. This extension is designed to work with MediaWiki 1.7.1 and higher.
 *     2. This extension leverages functionality of the HTML2FPDF library.
 * Installation:
 *     1. Create a directory in $IP/extensions called "html2fpdf"
 *         Note: $IP is your MediaWiki install dir.
 *     2. Extract the contents ofthe html2fpdf library to this directory.
 *     3. Drop this script (wiki2pdf.php) in $IP/extensions
 *     4. Enable the extension by adding this line to LocalSettings.php:
 *            require_once('extensions/wiki2pdf.php');
 * Usage:
 *     Once installed, this extension will add a 'pdf' link to the link-bar
 *     of every page.  Clicking this link will cause the PDF version of the
 *     page to be created and returned. 
 * -----------------------------------------------------------------------
 * Copyright (c) 2006 Jim R. Wilson
 * 
 * Permission is hereby granted, free of charge, to any person obtaining a copy 
 * of this software and associated documentation files (the "Software"), to deal 
 * in the Software without restriction, including without limitation the rights to 
 * use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 
 * the Software, and to permit persons to whom the Software is furnished to do 
 * so, subject to the following conditions:
 * 
 * The above copyright notice and this permission notice shall be included in all 
 * copies or substantial portions of the Software.
 * 
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
 * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 
 * OTHER DEALINGS IN THE SOFTWARE. 
 * -----------------------------------------------------------------------
 */

# Confirm MW environment
if (!defined('MEDIAWIKI')) die();

# Bring in the HTML2FPDF library
require_once('html2fpdf/fpdf.php');
require_once('html2fpdf/htmltoolkit.php');
require_once('html2fpdf/html2fpdf.php');

# Attach Hooks
$wgHooks['UnknownAction'][] = 'wfWiki2PDFAction';
$wgHooks['SkinTemplateContentActions'][] = 'wgAddWiki2PDFContentAction';

/**
 * Injects handling of the 'pdf' and 'pdf-test' actions.
 * Usage: $wgHooks['UnknownAction'][] = 'wfPDFAction';
 * @param $action Handle to an action string (presumably same as global $action).
 * @param $article Article to be converted to PDF  (presumably same as $wgArticle).
 */
function wfWiki2PDFAction($action, $article) {
    global $wgOut, $wgOutputEncoding, $wgScript;
    if( in_array($action, array('pdf','pdf-test')) ) {
        $header = '<h1>'.$article->getTitle()->getPrefixedText().'</h1>';
        $content = $wgOut->parse($article->getContent());
        if ( preg_match('/(.*)(<table[^\\>]*id=["\']toc["\'].*?<\\/table>)(.*)/ms', $content, $matches)) {
            list($pretoc, $origtoc, $posttoc) = array_slice($matches, 1);
            preg_match_all(
                '/<li class=["\']toclevel-(\\d+)["\']>.*?'.
                '<span class=["\']tocnumber["\']>\\s*(\\d[\\.\\d]*)\\s*<\\/span>\\s*'.
                '<span class=["\']toctext["\']>(.*?)<\\/span>/ms',
                $origtoc, $matches, PREG_SET_ORDER);
            $toc = "<h2>Table of Contents</h2>\n<hr /><table>\n";
            foreach ($matches as $match) {
                $d1 = ($match[1]==1 ? '<b>': ($match[1]>2 ? '<i>': ''));
                $d2 = ($match[1]==1 ? '</b>': ($match[1]>2 ? '</i>': ''));
                $toc .= '<tr><td>'.$d1.$match[2].$d2.'</td><td>'.$d1.$match[3].$d2.'</td></tr>'."\n";
            }
            $toc .= '</table><hr />';
        } else {
            $pretoc = '';
            $toc = '';
            $posttoc = $content;
        }
        $r = array(
            '/<big(\\s+[^>]*)?>/mis' => '',
            '/<\\/big>/mis' => '',
            '/<small(\\s+[^>]*)?>/mis' => '',
            '/<\\/small>/mis' => '',
            '/<span(\\s+[^>]*)?>/mis' => '<font${1}>',
            '/<\\/span>/mis' => '</font>',
            '/<dl(\\s+[^>]*)?>/mis' => '<ul${1} style="list-style-type:none">',
            '/<\\/dl>/mis' => '</ul>',
            '/<dd(\\s+[^>]*)?>/mis' => '<li${1}>',
            '/<\\/dd>/mis' => '</li>',
            '/<script(\\s+[^>]*)?>.*?<\\/script>/mis' => '',
            '/<div class="editsection".*?>/mis' => '<div style="display:none">',
            '/<table(.*?)>/mis' => '<table border="1"${1}>',
        );
        $pretoc = preg_replace(array_keys($r), array_values($r), $pretoc);
        $posttoc = preg_replace(array_keys($r), array_values($r), $posttoc);
        $footer = $article->getTitle()->getFullURL();
        $footer = "<p><b>Note:</b> The original content for this document was retrieved from:<br />\n".
                  '<a href="'.$footer.'">'.$footer.'</a></p>';

        # Run any hooks - give consumers a chance to fight back
        wfRunHooks('Wiki2PDF', array($article, &$header, &$pretoc, &$toc, &$posttoc, &$footer));
        
        # Create the PDF Document or output debugging info (depending on 'pdf' or 'pdf-test' respectively)
        $pdf = new HTML2FPDF();
        $pdf->AddPage();
        $pdf->urlbasepath = 'http'.($_SERVER['HTTPS']?'s':'').'://'.$_SERVER['SERVER_NAME'].'/';
        $pdf->setBasePath("./../wi");
        $pdf->WriteHTML($header.$pretoc.$toc.$posttoc.$footer);
        if ($action=='pdf') {
            $pdf->Output('doc.pdf','I');
            $wgOut->disable();
            header("Content-type: application/pdf");
        } else {
            $wgOut->setPageTitle(
                $article->getTitle()->getPrefixedText().
                ' - PDF Rendering Test' );
            $wgOut->addWikiText(";Header\n <pre>".htmlspecialchars($header)."</"."pre>\n");
            $wgOut->addWikiText(";PreTOC\n <pre>".htmlspecialchars($pretoc)."</"."pre>\n");
            $wgOut->addWikiText(";TOC\n <pre>".htmlspecialchars($toc)."</"."pre>\n");
            $wgOut->addWikiText(";PostTOC\n <pre>".htmlspecialchars($posttoc)."</"."pre>\n");
            $wgOut->addWikiText(";Footer\n <pre>".htmlspecialchars($footer)."</"."pre>\n");
            ob_start();
            $pdf->Output('doc.pdf','I');
            $c = ob_get_contents();
            ob_end_clean();
            $wgOut->addWikiText(";PDF\n <pre>".htmlspecialchars($c)."</"."pre>\n");
        }
        return false;
    }
    return true;
}

function wgAddWiki2PDFContentAction(&$content_actions) {
    global $action, $wgTitle;
    $text = wfMsg('pdf');
    $text = (wfEmptyMsg('pdf',$text)?'pdf':$text);
    $content_actions['pdf'] = array(
        'class' => ($action=='pdf'||$action=='pdf-test') ? 'selected' : false,
        'text' => $text,
        'href' => $wgTitle->getLocalUrl( 'action=pdf' )
        );
}

?>