Excel doc search

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
  • Apache
  • Mandrakelinux

I modified the standard /includes/SpecialUpload.php page so that uploaded Excel documents with the .xls extension will have their contents indexable.

I started by downloading and installing the Catdoc tool. Catdoc includes a Linux command line utility (xls2csv) that will convert an Excel doc's text to ASCII and output it.

Then I modified SpecialUpload.php where it tests for a successful upload and just before it inserts the uploaded file information into the database. What this does is make the text of the Excel document an HTML comment block in the description text of the image's file page.

A user must change their preferences to search Images to be able to search the image's page (or add images to the default namespace search).

if( $this->saveUploadedFile( $this->mUploadSaveName,
                              !empty( $this->mSessionKey ) ) ) {
  * Update the upload log and create the description page
  * if it's a new file.
  # MHART replace $textdesc with <!-- text from doc if .d
  if (strtolower($finalExt) == "xls") {
      $NewDesc = $this->mUploadDescription . "\r\n" . "<!-- ";
      $toexec = "/usr/bin/xls2csv " . $this->mSavedFile;
      exec($toexec, $DocText);
      foreach ($DocText as $DocLine) {
          $NewDesc .= "\r\n" . str_replace("-->","",$DocLine);
      $NewDesc .= "\r\n" . " -->";
      $NewDesc = $this->mUploadDescription;
  wfRecordUpload( $this->mUploadSaveName,
                  $NewDesc, # MHART - this line has been changed
                  $this->mUploadSource );

My actual script is a bit different - because I'm handling other file types in similar fashion. Here's the combined documentation.

--MHart 18:33, 10 May 2005 (UTC)

Note: I performed the catdoc 0.94 installation, and my found the executable in /usr/local/bin rather than the /usr/bin as shown above. -- Erik Heidt, 31 July 2005