User:Archivist

From Meta, a Wikimedia project coordination wiki

Jump to: navigation, search

'Full text indexing and spell checker for wikipedia'

Note this type of indexer works in the background in order that the user response times remain short. It may be possible to update real time later.

add to that part that stores an article:

 clear indexed flag when article has changed

cron job (this run could either be complete (daily) or better just modified (hourly) pages since last run)

 select next changed article 
  read diffs and apply to index
   for each word in diff
     word=spellcheck(word) (assume correct for ignored words)
     if not banned word then{ (no banned words at present)
      if + {add word to word count
            add word and source to index)
        else{subtract word to word count
            subtract word and source from index)
                      }

during the indexing an alphabetical list could be made Page sizing and cutting to be predetermined so that easy to browse index pages are created (can/should be static html for apache to cache)

  function spellcheck
     automagically change thier to their etc
          if not automagic word then
           add word and source to specialtypo table
  function while we are here{
      check and repair links // a sensible thing to do on a regular basis
      any other check deemed usefull
      }


special:typo

 list typos
  user selects typo and jump to source and edit 
       or adds word as an ok word
       or word added to vf add to word table
      and auto remove from typo table
   this option could be at the botom of every page "Spellcheck"

search.php

  explode incomming words
   while words in word count {
    get a count ordered list of the search terms
   } 

if array size less than incoming query count then all words not found > either exit or fall back to shorter list (partial success 'may' be usefull)

 select (lowest count word first) from index
  select next term word 'and' previous article no
   till success or fail
 deliver 50 answers to user

Notes

  • The word table can be linked to Wiktionary (that will force the wiktionary to be updated). We will need a full set of international wiktionary's
  • The word count drastically reduces the disk access count during the search
  • Doing it ourselves removes MySQL's dumb word list and character count restrictions.
The stopword list and minimum character count are run-time configurable in MySQL 4.0.x. --Brion VIBBER 10:30, 19 Nov 2003 (UTC)
  • MySQL has a habit where a query >50% of table then it reads the entire table and ignores the index ! but if a limit term has been applied this is nonsense. I had an 11 million row table with an integer primary index
'select * from table limit 11million -30,11million' (was attempting to browse table)
  • takes 7 minutes on 2gig athlon nobody else on the box ! Archivist 00:26, 20 Nov 2003 (UTC)
  • There might be a sheduler problem in MySQL that causes the brain dead periods.