Word doc search

From Meta, a Wikimedia project coordination wiki
  • Apache
  • Mandrakelinux

I modified the standard /includes/SpecialUpload.php page so that uploaded Microsoft Word documents with the .doc extension will have their contents indexable.

I started by downloading and installing the Antiword tool (I installed it from this RPM. It is a Linux command line utility that will convert a Word doc's text to ASCII and output it.

Then I modified SpecialUpload.php where it tests for a successful upload and just before it inserts the uploaded file information into the database. What this does is make the text of the word document (including 'hidden' text - that's the -s parameter of antiword) an HTML comment block in the description text of the image's file page.

A user must change their preferences to search Images to be able to search the image's page.

if( $this->saveUploadedFile( $this->mUploadSaveName,
                              $this->mUploadTempName,
                              !empty( $this->mSessionKey ) ) ) {
 /**
  * Update the upload log and create the description page
  * if it's a new file.
  */
  # MHART replace $textdesc with <!-- text from doc if .d
  if (strtolower($finalExt) == "doc") {
      $NewDesc = $this->mUploadDescription . "\r\n" . "<!-- ";
      $toexec = "/usr/bin/antiword -s " . $this->mSavedFile;
      exec($toexec, $DocText);
      foreach ($DocText as $DocLine) {
          $NewDesc .= "\r\n" . str_replace("-->","",$DocLine);
      }
      $NewDesc .= "\r\n" . " -->";
  }
  else
      $NewDesc = $this->mUploadDescription;
  ####
  wfRecordUpload( $this->mUploadSaveName,
                  $this->mUploadOldVersion,
                  $this->mUploadSize, 
                  $NewDesc, # MHART - this line has been changed
                  $this->mUploadCopyStatus,
                  $this->mUploadSource );
  $this->showSuccess();
 }

My actual script is a bit different - because I'm handling other file types in similar fashion. Here's the combined documentation.

--MHart 17:54, 23 Apr 2005 (UTC)