PDF Export

From Meta, a Wikimedia project coordination wiki

wiki2pdf

From the first wiki I began hosting for folks in my company, I received requests for a way to combine articles into a single PDF for printing. It turned out to be a lot easier than I thought it would be...

You'll need to have HTMLDOC installed. I use MandrakeLinux and installed it from the latest RPM. This open source utility can take multiple HTML pages and turn them into a PDF and is the key to making this work!

Start off by making a copy of index.php and call it PrintArticles.php. Near the bottom (between the big switch/case and $wgOut->output();, add the script below. What this script does is look for special coding on the page that it is viewing (in the same way index.php views a page).

Also, I first tested this stuff in /images and then created a folder called /printouts - gave it the same privileges as /images as well as the same .htaccess file.

The coding is very simple and works like this: (we'll call this page "Test Print")

  • Put articles that will appear in sequence in curly braces:
{Help-style indexing}
{Email Digest}
  • and put articles to combine into a single set of curly braces separated by the | pipe symbol.
{Word doc search | PDF doc search | Excel doc search}
  • These articles will not have a page break in the PDF file. This is really useful for articles that are short and related, such as a function list.
  • Then add a link to this page, but using the new PrintArticles.php file:

Now when a user browses to this page on your site and clicks the above link, the page will re-output through the PrintArticles.php file instead of index.php. The page will be changed from looking like this:

{Help-style indexing) {Email Digest} {Word doc search | PDF doc search | Excel doc search}

to this:

{Email Digest} {Word doc search | PDF doc search | Excel doc search}

Creating: (Email Digest) Creating: (Word doc search) Creating: (PDF doc search) Creating: (Excel doc search)

Couple of important notes about the following code:

  • Hard coded site urls... yes yes, I'm still lazy...
  • Hard coded /tmp folder for temporary files...
    • And I'm not deleting the /tmp files either...
  • I'm removing certain links and such - kinda clunky, and might not work with all skins. I'm doing it to make the resulting PDF look cleaner (like removing edit tags). It's okay if they are left in - except if you do leave all img tags in, you need to make sure and give read rights to the templates and skins and all image folders to other users - otherwise HTMLDOC can't import some of them.
  • Funky caching issues
  • Quotes print out as: a with an accent, euro symbol, followed by the trademark symbol
  • to take in account UTF8 characters, use by example 'iconv' tool in the loop, for example :
 exec("iconv --from-code=UTF8 --to-code=ISO_8859-1 -o /tmp/toto_" . str_replace(" ","_",$art) . " /tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm");
 exec("cp /tmp/toto_" . str_replace(" ","_",$art) . " /tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm");
  • Images inserted in the article can be caught with something like that (replacing the same line in the code in the next paragraph):
$NewBodyText .= "<h1>" . $art . "</h1><hr>" . str_replace('href="/wiki', 'href="http://' . $_SERVER["SERVER_NAME"] . '/wiki', 
                    str_replace('src="/wiki', 'src="http://' . $_SERVER["SERVER_NAME"] . '/wiki',
                    str_replace('<a href="/index.php', '<a href="http://' . $_SERVER["SERVER_NAME"] . '/index.php',
                    str_replace('<img src="/images/thumb',
                    '<img src="' . $_SERVER["DOCUMENT_ROOT"] . '/images/thumb', $bodyText))));

Here's the code that does all the dirty work:

$PDFFile = $_SERVER["DOCUMENT_ROOT"] . '/printouts/' . str_replace("'","_",
       str_replace(" ","_",$wgTitle->getText())) . ".pdf";
$PDFExec = "/usr/bin/htmldoc --webpage -f " . $PDFFile;
$addedText = "";

$SaveText = $wgOut->mBodytext;
$wgOut->mBodytext = "";

$i = strpos($SaveText,"{");
while ($i >= 0 && $SaveText != "") {
  $j = strpos($SaveText,"}");
  if ($j <= $i) break;
  $multi_art = explode('|',substr($SaveText, $i+1, $j-$i-1));
  if (strlen($SaveText) > $j+1)
    $SaveText = substr($SaveText, $j+1);
  else
    $SaveText = "";
  $NewBodyText = "";
  foreach ($multi_art as $one_art) {
    $wgOut->mBodytext = "";
    $art = trim($one_art);
    $addedText .= "Creating: (" . $art . ")<br>";
    $PDFTitle = Title::newFromURL( $art );
    $PDFArticle = new Article($PDFTitle);
    $PDFArticle->view();
    $bodyText = str_replace('<img src="/stylesheets/images/magnify-clip.png" width="15" height="11" alt="Enlarge" />',
                '',
                str_replace('<div class="editsection" style="float:right;margin-left:5px;">[',
                '',
                str_replace('>edit</a>]</div>',
                '></a>', 
                $wgOut->mBodytext)));
    $NewBodyText .= "<h1>" . $art . "</h1><hr>" . str_replace('<a href="/index.php',
                    '<a href="http://' . $_SERVER["SERVER_NAME"] . '/index.php',
                    str_replace('<img src="/images/thumb',
                    '<img src="' . $_SERVER["DOCUMENT_ROOT"] . '/images/thumb',
                    $bodyText));
  }
  $h = fopen("/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm" ,"w");
  fwrite($h,"<html><body>");
  fwrite($h,$NewBodyText);
  fwrite($h,"</body></html>");
  fclose($h);
  $PDFExec .= " " . "/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm";
  $i = strpos($SaveText,"{");
}

exec($PDFExec, $results);
foreach ($results as $line)
  $addedText .= $line . "<br>";

$addedText .= "<br><a href='http://" . $_SERVER["SERVER_NAME"] . '/printouts/' .
              str_replace("'","_",str_replace(" ","_",$wgTitle->getText())) . ".pdf'>" . 
              $wgTitle->getText() . ".pdf</a>";
$wgOut->mBodytext = "";

$wgArticle->view();
$wgOut->addHTML($addedText);

And there you have it. Enjoy. --MHart 20:26, 11 May 2005 (UTC)[reply]

Software fixes --Michael Bushey 10:17, 20 May 2005 (PST)



Q. Umm, where am I supposed to put this code?? Or more like, what version of mediawiki is this for??

A. Line 269 on my version of mediawiki (1.5.3). Near the end of the file (13 lines from the EOF), just above "$wgOut->output();".

Q: Where to put this in mediawiki 1.6.6 (June 2006)?

A: Under 1.8.2 (Nov 06) I got it to work by commenting out the $mediaWiki->finalCleanup( ... line, and put the code under there. It also needed a putenv("HTMLDOC_NOCGI=1"); added before the call to htmldoc.

Q: Where/How do I have to "Then add a link to this page, but using the new PrintArticles.php file:"???

A: Rewrite entire url like an external link is the lazy solution

Q: What does it looks like for a french version, installed in /mediawiki, with image and utf-8 :

A: To this :

$wgLoadBalancer->commitAll();

$PDFFile = $_SERVER["DOCUMENT_ROOT"] . '/printouts/' . str_replace("'","_",
       str_replace(" ","_",$wgTitle->getText())) . ".pdf";
$PDFExec = "/usr/bin/htmldoc --webpage -f " . $PDFFile;
$addedText = "";

$SaveText = $wgOut->mBodytext;
$wgOut->mBodytext = "";

$i = strpos($SaveText,"{");
while ($i >= 0 && $SaveText != "") {
  $j = strpos($SaveText,"}");
  if ($j <= $i) break;
  $multi_art = explode('|',substr($SaveText, $i+1, $j-$i-1));
  if (strlen($SaveText) > $j+1)
    $SaveText = substr($SaveText, $j+1);
  else
    $SaveText = "";
  $NewBodyText = "";
  foreach ($multi_art as $one_art) {
    $wgOut->mBodytext = "";
    $art = trim($one_art);
    $addedText .= "Creating: (" . $art . ")<br>";
    $PDFTitle = Title::newFromURL( $art );
    $PDFArticle = new Article($PDFTitle);
    $PDFArticle->view();
    $bodyText = str_replace('<img src="/stylesheets/images/magnify-clip.png" width="15" height="11" alt="Enlarge" />',
                '',
                str_replace('<div class="editsection" style="float:right;margin-left:5px;">[',
                '',
                str_replace('>modifier</a>]</div>',
                '></a>', 
                $wgOut->mBodytext)));
    $NewBodyText .= "<h1>" . $art . "</h1><hr>" . str_replace('href="/mediawiki', 'href="http://' . $_SERVER["SERVER_NAME"] . '/mediawiki', 
                    str_replace('src="/mediawiki', 'src="http://' . $_SERVER["SERVER_NAME"] . '/mediawiki',
                    str_replace('<a href="/index.php', '<a href="http://' . $_SERVER["SERVER_NAME"] . '/index.php',
                    str_replace('<img src="/images/thumb',
                    '<img src="' . $_SERVER["DOCUMENT_ROOT"] . '/images/thumb', $bodyText))));
  }
  $h = fopen("/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm" ,"w");
  fwrite($h,"<html><body>");
  fwrite($h,$NewBodyText);
  fwrite($h,"</body></html>");
  fclose($h);
  exec("iconv --from-code=UTF8 --to-code=ISO_8859-1 -o /tmp/toto_" . str_replace(" ","_",$art) . " /tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm");
  exec("mv -f /tmp/toto_" . str_replace(" ","_",$art) . " /tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm");
  $PDFExec .= " " . "/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm";
  $i = strpos($SaveText,"{");
}

exec($PDFExec, $results);
foreach ($results as $line)
  $addedText .= $line . "<br>";

$addedText .= "<br><a href='http://" . $_SERVER["SERVER_NAME"] . '/printouts/' .
              str_replace("'","_",str_replace(" ","_",$wgTitle->getText())) . ".pdf'>" . 
              $wgTitle->getText() . ".pdf</a>";
$wgOut->mBodytext = "";

$wgArticle->view();
$wgOut->addHTML($addedText);

$wgOut->output();

Slightly Condensed 1.6.7 Version[edit]

For 1.6.7 try this:

  1. Create a new directory in your web root called 'pdf'.
  2. Create a new file in the extensions directory called ExtraActions.php and add the text below into it.
  3. Go to the article you want to turn into a PDF and call the page with action=pdf (ie http://localhost/wiki/index.php?title=Main_Page&action=pdf)
  4. Open LocalSettings.php and add the line include('extensions/ExtraActions.php'); at the bottom.
<?php

########################################################
#### Please, someone Include UNICODE utf-8 Support #####
########################################################
# GLOBALS
$wgHTMLDocPath = "C:\\path\to\htmldoc\htmldoc.exe";  #WARNING!!! Only for WINDOWS Users!!!
#$wgHTMLDocPath = "/usr/bin/htmldoc";                 #WARNING!!! for linux/unix  users

# REGISTER HOOK 
global $wgHooks;
$wgHooks['UnknownAction'][] = 'wfExtraActions';

# HOOK FUNCTION 
function wfExtraActions($action, $article) {
  switch( $action ) {
    case 'pdf':
      $pdf = new PDFPage( $article );
      $pdf->view();
      break;
    default:
      return true;
  }
  return false;
}


class PDFPage {
  var $mArticle;
  var $mFile;
  var $mCommand;
  
  function PDFPage( $article ) {
    global $wgHTMLDocPath;
    $this->mArticle = $article;
    $this->mFile = $_SERVER["DOCUMENT_ROOT"] . '/pdf/' . str_replace(' ','_',$this->mArticle->mTitle->getText()) . '.pdf';
    $this->mCommand = $wgHTMLDocPath . ' --webpage -f "' . $this->mFile . '"';
  }
  
  function view() {
    $this->mCommand .= ' "' . $this->mArticle->mTitle->getFullURL() . '"';
    exec($this->mCommand, $results);
    header('Location: http://' . $_SERVER["SERVER_NAME"] . '/pdf/' . str_replace(' ','_',$this->mArticle->mTitle->getText()) . '.pdf');
  }
}
?>

Notes:

  • This only makes one page into a PDF and doesn't combine pages like the original method. A fun project would be adding an option to SpecialExport to export multiple documents to PDF.
  • There is no intermediate step of saving the document to a tmp directory so if you need to run iconv you will have to add that step back in.
  • If you want, you can make a tab at the top of the page next to edit, history, etc for pdfs. That's a project for another day.
  • Linux Fix

A gotcha exists in the php if there is a slash in the title of the page. this is translated to a filename with embedded slash which is seen by the O/S as a path (which does not exist). The solution is to replace the code snippets in the filename lines

<code>
"/tmp/" . str_replace("'","_",str_replace(" ","_",$art)) . ".htm"
</code>

with

<code>
"/tmp/" . preg_replace("/[\'\s\/]/","_",$art) . ".htm"
</code>


Hopes