Stop word list

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Blue Glass Arrow.svg MediaWiki-2020-logo.svg
A proposal to move this page to was rejected.
This page may be useful on mediawikiwiki, since not everyone use Lucene search, perhaps. Don't forget subpages.

This is an organising page for the developers and others working on improving the stop word list used in MediaWiki.

Current status[edit]

The current list is the default stop word list in MySQL version 4.0.20 plus all single characters. The default four character minimum length limit was reduced to 2. A single English language stop word list is currently used for all projects. MySQL 4.0.20 stop word list

The initial fulltext index statistics for en Wikipedia provide the most frequent 5,000 words in the initial English Wikipedia index.

Technical words[edit]

These words are commonly used in Mediawiki pages and come from the English statistics:

15px 200px 250px 300px 3px ca category cd colspan da de disambig en eo es et fi fr he hu it ja mdash nbsp ndash nl no pl pt redirect ro ru sl sr stub sv user user_talk utc zh

bg and uk are excluded from the list of biggest wikis because they are also used with other meanings.

Hippietrail's word frequency lists[edit]

Consolidated words to remove list[edit]

This list of words will be removed from the stop word list if present. They are words included because of the way a particular stop word list was built, or because they're words which are stop words in one language but have significant meaning in another:

Úrsula Amaranth Aureliano bg Bilbas Buendía Detonando die Duda Dudley dumbledore email era Gendalfas god Hagrid Harry Hermine Hermione Malefoy malfoy material meg Neville nun potter professeur Professor Rogue Ron Rony Semas Sirius store sue uk usa valeur valor www

Consolidated stop word list[edit]

The /consolidated stop word list is will be in use on a trial basis. It is composed of all of the MySQL list, the technical list, all of/Hippietrail consolidated stop word list and /google stop word list but with the words in the consolidated words to remove list removed.