Case sensitivity of page names

From Meta, a Wikimedia project coordination wiki

Article in need of improvement

Case sensitivity in MediaWiki is both a blessing and a curse. Sometimes case matters, and preserving case allows mediawiki to handle those few cases where case matters. However, the case sensitivity can result into a failed search result. For example, when you are looking for an article BBQ by entering bbq, an untwisted MediaWiki will think that this page (bbq) does not exist and ask you to create a new page.


Wikimedia solution[edit]

One of the solutions used by Wikipedia is to create a separate table for the keys, so if it works cleanly it can be deployed without an expensive rebuild of core tables, and dumped when Wikimedia gets a nicer backend through lucene. For Wikipedia, this is done by installing Extension:TitleKey. It usually can solve most problems related to the case sensitivity during making a search query.

Auto redirect[edit]

Automatically redirect to a page that has same spelling but different capitalization (have the computer do the disambiguation pages when a spelling doesn't match an existing page). The Extension:SaneCase works this way.

Negatives: Performance and possible search engine duplicate content penalties caused by MediaWiki's redirection mechanism.

Disambiguation page[edit]

Manually create disambiguation pages, then switch the wiki to case preserving case insensitive.

Plus: maybe performace will increase due to lower number of pages? and perhapse a mysql setting could be toggled to let it search faster too (because of no duplicates when doing a case insensitive search)

Negatives: lots of manual labor

Possible Solution 3[edit]

  • Implementation would go as follows: there would be an all lowercase database table that would get the page name, and if there are more than one capitalization methodology for pages that exist, it should go to the page that matches, or if none match, show a list of pages with the same spelling, and have a way to create a new page with that spelling.

Mysql could probably handle the merge quickly... between the "multiple document table" and the unique/main table

Minuses: Hard to code?

Plusses: Effectively keeps case sensitivity.

There could be a bit of a speed up, because most pages would just have the lowercase lookup succeed...

Actually, this could allow for quicker searches, because the table would be unique, so once a match is found, it doesn't have to look for any more. If there are multiple pages, a flag could be set on the main table to indicate that there are multiple pages

Possible Solution 4[edit]

  • A flag that says "title capitalization ok" could make the wiki renderer capitalize words in the title to title capitalization
  • Users could be nagged before creating a page without creating a page with standard convention for titles.
  • The redirect pages could be made..

Possible Solution 5 (Originally Solution 6)[edit]

Change the attributes of the table "cur" so that the field cur_title is not BINARY. Then searches return results however the actual page is capitalized.

Possible Solution 6 (Originally Solution 7)[edit]

You can alter the 'page' table to contain non-case sensitive column for the page title, this solves the problem, and is easy to implement but introduces several international caveats. This process is described in detail here:

Possible Solution 7 (Originally Solution 8)[edit]

Add a Special Page in the Maintenance section that lists pages where the titles only vary by capitalization. Then editors can deal with it when it occurs. Sort the results on this special by the min of the date updated descending of any of the pages; this way editors can monitor the top of the list.

Option for any solution: Per-site Preferences[edit]

Would it alleviate some of the pain of "global/forced" application if the case-sensitivity was enabled by default and disabled (on a per-site-install basis) per admin direction? Further, when presented with a "disable case sensitivity" interface (particularly if it were displayed from a "Preferences" page), the software could convey to the admin which languages the case-insensitivity option currently supports; and/or associated documentation could also speak to which languages were supported and which not.

(Sorry if I missed this already presented somewhere; also I'm new to MediaWiki, haven't found a site-wide "Preferences" place yet other then LocalSettings.php - MattEngland 22:04, 16 Apr 2005 (UTC))

The above post is really quite old, but I'm going to respond to it anyway. I think most people who set up MediaWiki don't care at all about case-sensitivity and [[page]] going to a different page than [[Page]]. Hence the default should be not case sensitive. I think that most people go for case sensitivity simply to be able to display the first letter in lowercase. --Romanski 21:29, 8 January 2007 (UTC)[reply]

Related code[edit]

LanguageUtf8.php and Utf8Case.php

There would need to be two tables for every 1, a lowercase one, and an uppercase one??

(actually, the database can already search case insensitive...) (there could be a problem with duplicate searches..)

Problems with the possible solutions[edit]

  • Have to be coded.
  • Non trivial
  • Language dependant

Lazy IRC paste[edit]

From a conversation on IRC:

<MrDarkUser> TimStarling: I rooting for case-insensitive page lookup after case-sensitive lookup
<MrDarkUser> boy oh boy my spelling is bad.. s/I/I'm
<MrDarkUser> I don't like making tons of redirects, and I don't like miss capitalization of titles..
<MrDarkUser> ... Also... if page names were all stored in lower case.. , with the exception being case preserving.. there could be a performance increase
<MrDarkUser> because there are 1/2 as many characters to choose from when searching, and it would lower the number of pages in the wiki that are created just to deal with poor capitalization that have to get searched through...
<MrDarkUser> of course.. this all has to get coded...
<TimStarling> I don't think the performance issues would be significant
<TimStarling> actually...
<TimStarling> it would take longer to parse pages with lots of links, especially on UTF-8 wikis where there's no mb_string
<MrDarkUser> TimStarling: oh.. you are right that it would take longer to parse links... mb_string?
<MrDarkUser> I just assumed that the to_lower function was very fast
<TimStarling> it's kind of non-trivial, unfortunately
<TimStarling> and language-dependent
<MrDarkUser> How do other wiki's deal with it? (a question that I shall have to try to look up)
<MrDarkUser> mediawiki is the only wiki that I know of that is case sensitive.. and that almost kept me from using it
<TimStarling> see LanguageUtf8.php and Utf8Case.php
<MrDarkUser> language-dependent... hurm.. there are seporate wiki's for each language though? or at least namespaces..
<Kate-> they inherit LanguageUtf8