User:Archivist
From Meta, a Wikimedia project coordination wiki
'Full text indexing and spell checker for wikipedia'
Note this type of indexer works in the background in order that the user response times remain short. It may be possible to update real time later.
add to that part that stores an article:
clear indexed flag when article has changed
cron job (this run could either be complete (daily) or better just modified (hourly) pages since last run)
select next changed article
read diffs and apply to index
for each word in diff
word=spellcheck(word) (assume correct for ignored words)
if not banned word then{ (no banned words at present)
if + {add word to word count
add word and source to index)
else{subtract word to word count
subtract word and source from index)
}
during the indexing an alphabetical list could be made Page sizing and cutting to be predetermined so that easy to browse index pages are created (can/should be static html for apache to cache)
function spellcheck
automagically change thier to their etc
if not automagic word then
add word and source to specialtypo table
function while we are here{
check and repair links // a sensible thing to do on a regular basis
any other check deemed usefull
}
special:typo
list typos
user selects typo and jump to source and edit
or adds word as an ok word
or word added to vf add to word table
and auto remove from typo table
this option could be at the botom of every page "Spellcheck"
search.php
explode incomming words
while words in word count {
get a count ordered list of the search terms
}
if array size less than incoming query count then all words not found > either exit or fall back to shorter list (partial success 'may' be usefull)
select (lowest count word first) from index select next term word 'and' previous article no till success or fail deliver 50 answers to user
Notes
- The word table can be linked to Wiktionary (that will force the wiktionary to be updated). We will need a full set of international wiktionary's
- The word count drastically reduces the disk access count during the search
- Doing it ourselves removes MySQL's dumb word list and character count restrictions.
- The stopword list and minimum character count are run-time configurable in MySQL 4.0.x. --Brion VIBBER 10:30, 19 Nov 2003 (UTC)
- MySQL has a habit where a query >50% of table then it reads the entire table and ignores the index ! but if a limit term has been applied this is nonsense. I had an 11 million row table with an integer primary index
'select * from table limit 11million -30,11million' (was attempting to browse table)
- takes 7 minutes on 2gig athlon nobody else on the box ! Archivist 00:26, 20 Nov 2003 (UTC)
- There might be a sheduler problem in MySQL that causes the brain dead periods.