PDF doc search

From Meta, a Wikimedia project coordination wiki

See PDF doc search II for another solution to this problem.

  • Apache
  • Mandrakelinux

I modified the standard /includes/SpecialUpload.php page so that uploaded PDF documents with the .pdf extension will have their contents indexable.

I started by downloading and installing the XPDF tool. XPDF includes a Linux command line utility that will convert a PDF doc's text to ASCII and output it.

Then I modified SpecialUpload.php where it tests for a successful upload and just before it inserts the uploaded file information into the database. What this does is make the text of the PDF document an HTML comment block in the description text of the image's file page.

A user must change their preferences to search Images to be able to search the image's page (or add images to the default namespace search).

if( $this->saveUploadedFile( $this->mUploadSaveName,
                              $this->mUploadTempName,
                              !empty( $this->mSessionKey ) ) ) {
 /**
  * Update the upload log and create the description page
  * if it's a new file.
  */
  # MHART replace $textdesc with <!-- text from doc if .d
  if (strtolower($finalExt) == "pdf") {
      $NewDesc = $this->mUploadDescription . "\r\n" . "<!-- ";
      $toexec = "/usr/bin/pdftotext " . $this->mSavedFile . " -";
      exec($toexec, $DocText);
      foreach ($DocText as $DocLine) {
          $NewDesc .= "\r\n" . str_replace("-->","",$DocLine);
      }
      $NewDesc .= "\r\n" . " -->";
  }
  else
      $NewDesc = $this->mUploadDescription;
  ####
  wfRecordUpload( $this->mUploadSaveName,
                  $this->mUploadOldVersion,
                  $this->mUploadSize, 
                  $NewDesc, # MHART - this line has been changed
                  $this->mUploadCopyStatus,
                  $this->mUploadSource );
  $this->showSuccess();
 }

My actual script is a bit different - because I'm handling other file types in similar fashion. Here's the combined documentation.

--MHart 17:15, 10 May 2005 (UTC)[reply]

Modification to fix broken description[edit]

I think you need to modify the $this->mUploadSize line too, otherwise you will truncate your new description. I've changed it to strlen($NewDesc).

So update this function call

  wfRecordUpload( $this->mUploadSaveName,
                  $this->mUploadOldVersion,
                  $this->mUploadSize, 
                  $NewDesc,  # MHART - this line has been changed
                  $this->mUploadCopyStatus,
                  $this->mUploadSource );

with

  wfRecordUpload( $this->mUploadSaveName,
                  $this->mUploadOldVersion,
                  strlen($NewDesc),  # MARKSW - this line has been changed
                  $NewDesc,  # MHART - this line has been changed
                  $this->mUploadCopyStatus,
                  $this->mUploadSource );

Other than that, it's great! Thanks for this :) Marksw 11:58, 7 October 2005 (UTC)[reply]