Stop word list

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Ambox outdated content.svg This page is outdated, but if it were updated, it may still be useful. Please help by correcting, augmenting and revising the text into an up-to-date form.
Blue Glass Arrow.svg MediaWiki logo.png
A proposal to move this page to MediaWiki.org was rejected.
Because the Template:MoveToMediaWiki tag was on the page for a year without any MediaWiki.org importers seeing fit to transwiki it, the move proposal was regarded as rejected by the MediaWiki.org community.
This page may be useful on mediawikiwiki, since not everyone use Lucene search, perhaps. Don't forget subpages.

This is an organising page for the developers and others working on improving the stop word list used in MediaWiki.

Contents

[edit] Current status

The current list is the default stop word list in MySQL version 4.0.20 plus all single characters. The default four character minimum length limit was reduced to 2. A single English language stop word list is currently used for all projects. MySQL 4.0.20 stop word list

The initial fulltext index statistics for en Wikipedia provide the most frequent 5,000 words in the initial English Wikipedia index.

[edit] Technical words

These words are commonly used in Mediawiki pages and come from the English statistics:

15px 200px 250px 300px 3px ca category cd colspan da de disambig en eo es et fi fr he hu it ja mdash nbsp ndash nl no pl pt redirect ro ru sl sr stub sv user user_talk utc zh

bg and uk are excluded from the list of biggest wikis because they are also used with other meanings.

[edit] Hippietrail's word frequency lists

[edit] Consolidated words to remove list

This list of words will be removed from the stop word list if present. They are words included because of the way a particular stop word list was built, or because they're words which are stop words in one language but have significant meaning in another:

Úrsula Amaranta Aureliano bg Bilbas Buendía Detonando die Duda Dudley dumbledore email era Gendalfas god Hagrid Harry Hermine Hermione Malefoy malfoy material meg Neville nun potter professeur Professor Rogue Ron Rony Semas Sirius store sue uk usa valeur valor www

[edit] Consolidated stop word list

The /consolidated stop word list is will be in use on a trial basis. It is composed of all of the MySQL list, the technical list, all of/Hippietrail consolidated stop word list and /google stop word list but with the words in the consolidated words to remove list removed.

[edit] Resources

Personal tools

Variants
Actions
Navigation
Community
Beyond the Web
Print/export
Toolbox