Development tasks/Archive

From Meta, a Wikimedia project coordination wiki

A number of things that really should get done. Somewhere between bugs and feature requests; these are things that work, but not quite right, or not quite well.

If you'd like to claim one, drop your name and date under it.

VERY OUT OF DATE

Tasks listed by type (beware, possibly outdated stuff)[edit]

Optimization[edit]

Watchlist[edit]

  • This should be faster if pages and talk pages are stored in their own rows (though still managed as pairs) and a timestamp is added, so very long watchlists (a couple thousand pages or more among some power users) don't have to be slurped in their entirety and sorted just to grab the last hundred off the top. Haven't yet tested this theory. Needs an index on title in order to make mass updates fast, too.
    • The forthcoming feature EmailNotification already makes use of such timestamps indicating the user's last visit (touch) to a page. --Nyxos 05:30, 6 Sep 2004 (UTC)
    • I also stopped the equal handling (bit bitwise masking etc.) of pages and their talk-pages when these are watch-listed. The equal handling is a nice feature but should in my opinion not be handled by bitwise masking but with two separate entries in the database (normal page and talk-page entry). In other words, I created a database entry for each namespace:page which is watch-listed. The changes in the code were extremely simple. The small change was necessary for a clean handling of notification timestamps. --Nyxos 21:42, 6 Sep 2004 (UTC)
    • md5(namespace:pagetitle) for watchlist and other entries in a separate column of table watchlist can be used as index. A new table holds the correspondence between md5() and namespace:title. An additional use would be md5(namespace:pagetitle) . ";" $rc_last_oldid(or other) as a version number, which can be used as an index to table cur. --Tom Gries (talk) [mail me] 22:38, 21 Oct 2004 (UTC)
      • MD5 hash is too big for the proposed job - CRC32 or other 4 byte integer format will produce a much smaller index and increased cache hit rate (because more is in RAM) would compensate for any duplicates, particularly for sites with limited RAM or large databases. Using CRC32 for checking whether cur articles exist (for link coloring) would almost certainly be helpful and has been proposed by me for the change in cur in the enext version of MediaWiki software. Was also proposed by one of the MySQL performance experts. Jamesday 17:22, 26 Oct 2004 (UTC)
    • Splitting talk and article page notification will probably slow it down. To really speed up watchlist queries one step has already ben taken (removing a test which takes more time than the work it was trying to save), following steps are:
      • split text part of cur into a different table (kturner has already commenced working on this). This makes the size of cur records much smaller and helps most tasks involving cur.
      • completing testing of an optimised query and added index (I've done the development work on the query, haven't fully benchmarked it).Jamesday 12:29, 1 Oct 2004 (UTC)

Search[edit]

  • Boolean search mode may return results faster than non-boolean mode for multi-word searches.
    • We already use boolean search. Jamesday 12:29, 1 Oct 2004 (UTC)
  • Will it be more efficient if we remove the join to cur, and add the necessary dupe fields (redirect, namespace) for cutting down the search space in the searchindex table?
    • Doing both of these in temporary experimental code on en.wiki, but with a separate index database. Will need to test an integrated version where the fields are updated and it can all be in the same database, though still not joined.
    • This testing has already been done as part of my work on slow queries and is far faster, using only the existing searchindex table than when cur is used. Still to be done is integrating the query with the PHP code and arrangig for load sharing to send searches to a different pool of database servers. Jamesday 12:29, 1 Oct 2004 (UTC)

Rendering[edit]

  • There is probably some nasty stuff called a billion times in tight loops in the replaceInternalLinks() function. It's dog slow.
  • Shared memory
  • There's a lot of setting of constants that's gonna get run on every darn request, even if we've got the PHP pre-parsed. It would likely be more efficient to grab things like interwiki lists, interface strings, and the UTF-8 conversion tables from per-wiki shared memory segments.
    • Also experimenting with memcached --Brion VIBBER
      • That went well and memcached is now in use for parser cache and logins. Jamesday 17:22, 26 Oct 2004 (UTC)

Profiling[edit]

  • There's some primitive profiling code already, but some better stats would help: total time spent in a function (across multiple calls) would be a useful thing. Also need more thorough coverage.

Special pages[edit]

  • Orphans/Wantedpages
  • Create summary tables and update them in the page-save link update step (and other times when links are changed in ways we can more or less see). Alternatively, find a way to make the live queries usably fast.
    • The results are cached now. Generating them is still too slow - needs summary tables with optimised indexes. On my to do list is filling the cache table using optimised queries with summary and/or temporary tables. If that works well it'll prove the approach for the wiki PHP code. Jamesday 17:22, 26 Oct 2004 (UTC)

Reliability[edit]

  • There are many little race conditions, as we don't do all the locking and transaction checking we should.
    • Simultaneous edit bug
      • B and C are saved nearly simultaneously, A is the previous revision both are based on, and which is current when both make their edit conflict checks. Result is that while recentchanges shows A->B->C, old has two copies of A and cur holds C, so the contents of B are missing.
        • Tim Starling 13:06, 10 Jul 2003 (UTC). Sounds like fun. I'm thinking GET_LOCK()/RELEASE_LOCK().
        • A partial fix is in for simultaneous edits which basically just duplicates the conflict check duping the update, which itself is atomic. May leave some more exotic scenarios unchanged. --Brion VIBBER 05:31, 18 Aug 2003 (UTC)
  • Some conditions that should invalidate cached pages may not do so. Find and destroy!
    • Fixed: log pages not invalidated when updated
    • Fixed: image description pages not invalidated when new version uploaded
  • Declare war on user aborts! Whenever a number of subsequent DB/filesystem write operations take place, e.g. editing or moving, we need to prevent user aborts from stopping execution halfway. Possible approaches are by simply disabling them, or by putting in an abort function to roll back transactions. DO NOT begin a transaction without guarding it against user aborts.

Usability[edit]

  • Fix login to test that cookies were set
  • Edit conflict prevention on page deletion
    • Has this been done? I forget
  • Use session cookie to store another cache epoch so we can redraw anon-style pages after user logs out
  • On the french wikipedia, we have problems with articles whose name contain œ. In the PHP code, there is a comment that this sort of thing sould be done some day ...
  • Sysops should be able to block logged in users. This has wide support, including from Jimbo.
  • Ability to view source of protected pages.
  • A large task: Case insensitivity for the "Go" and "Search" functions. This would require additional disambiguation pages to be created. See Case_insensitivity and Talk:Case_insensitivity and the more recent discussion at [[[:wikt:en:Wiktionary:Beer_parlour/case-sensitivity_vote|en Wiktionary]], which reflects the current thinking of using case preservation for displayed tiles and case ignoring for search, using a different article name field to remember one of the two versions. Jamesday 12:29, 1 Oct 2004 (UTC)

Code quality[edit]

  • Refactor duplicated code...
    • Special pages and the language files are particularly ugly in this respect.
  • Try to make some sense of the mess that is GlobalFunctions.php
  • Ensure that the test suite is working and up to date
  • Unit tests that test individual functions and classes in isolation may be good too

Documentation quality[edit]

  • Install needs a rewrite.
  • Currently the downloads are in many places and not up to date & not documented.

Request throttling[edit]

  • request throttling of page views to protect against poorly designed, misintentioned, or malicious spiders draining server resources
    • Querybane now does this in a way, killing unusually slow queries only when the database servers are overloaded. Removing inefficiencies in the slow queries should eventually remove the problem. Jamesday 17:22, 26 Oct 2004 (UTC)

Feature requests[edit]

  • Namespaces that only those in a specific group can view or edit,for example Admins: could be a namespace with info just for admins, you would need to be in a group that has access to the admins namespace to view or edit articles in that namespace.--71.32.5.180 07:05, 9 Jun 2005 (UTC)
  • Article Milestone Version in Article History should be saved and extruded.220.201.18.95 05:35, 2 May 2005 (UTC)[reply]
  • Has anyone an idea how to implement an extension to support online creation and editing of unified modelling language (UML) charts. I think about something between an online bitmap editor and the existing wikimedia-extensions for math-formulas (LaTeX) or hyroglyphes… (User:BjBlunck)
  • Anaother idea would be a real online bitmap collaboraton tool for editing pictures. Is there any open source tool that could be implemented?(User:BjBlunck)
  • WYSIWYG editing and possibly Spell checker - WYSIWYG based on xhtml, source view and saving in wiki src
  • How about adding another bracket var 40,877,467 which would contain... well... the number of registered users? It could be implemented in an "off-line" , "per-event" manner, just like the site stats, in that not every time the variable would be requested a "SELECT COUNT(*) FROM USERS" SQL query would be executed, but rather a per-user-registration counter could be updated. Obviously a crontabbed script could take care of accuracy once per day, users who haven't contributed in more than six months could be counted out once in a while by setting a flag in the users table, etc. I'm advocating for this feature because showing an automated number of users could be helpful for localized Wikipedias where the progress in "number of registered users" is interesting to track publicly, within articles or even on the main page. I could implement the feature myself, I'm just not sure if you find it a good idea. --Gutza 00:51, 2 Sep 2003 (UTC)
  • In addition to a preview button at the edit stage, have a diff button, so you can see what you have already changed about the text. The code should already be there. Just add the button, eh? -- Cimon Avaro on a pogo stick 23:09, 9 Sep 2003 (UTC)
  • templates -- see Skins
  • Maybe it is already one of your goals. But why don't you use XML for saving articles ? Interoperability and reusability is a very important aspect of such a database. And distinguishing the view from the contents enable lots of possibilities (administration, evolution, several views, etc..)
  • It'd be great to have a user-pref that enables an email to be sent to a user whenever any page in his/her watchlist is modified
  • See Enotif, which is part of 1.5 --MaPhi 10:56, 2005 May 4 (UTC)
  • How about adding ACL for Pages and/or Namespaces? I'm working in a project that need some kind of an internal area and we don't want to use two Wikis.
  • I support the request for ACL for pages. We use MediaWiki software in a company internal wiki with great success. I could convince even more people, if I had better control on write access to certain pages. Of course, projects like Wikipedia don't need this feature, but better acceptance of MediaWiki in companies would generate positive effects for other wiki projects.
  • I would like to second your requests for ACL's, as it would put Mediawiki to a new level of managability. I've already implemented a LDAP-authentication on signing in a new User as a quick hack. Maybe a connection to LDAP with ACL's would be very pleasant. --Feffi 22:44, 5 Jan 2005 (UTC)
    • For a non-quick-hackish LDAP authentication plugin, check out my LDAP Authentication plugin. -- Ryan Lane
  • I'd love to see ACL support for pages. It could even be used on the wikipedia sites as a solution for edit wars and users that are behaving badly on some pages, but are in general positive contributors. For instance, if an article is having a problem with edit wars between two or three editors, but has 5 or 6 other editors who are not causing problems, you could put a temporary ban on the three bad users for a set amount of time, still allowing the other editors to contribute. Also, if you have an editor who is in general a positive contributor, but just cant help himself on vandalising the George Bush article, it would be possible to ban that user from editing on the George Bush article, but let him/her edit elsewhere. -- Ryan Lane
  • ACL would be a cool feature, better still if ACL's can be controlled and configured from a browser
  • category flatten (--grin 21:26, 16 Feb 2005 (UTC))
  • Tools to rotate and stroke pictures (without reloading it)
  • Including a named section of a template or article (for horizontal views of changing articles -- useful for, say, taking the "Next Steps" sections out of specified project pages, or making an Upcoming Events page by including the Events section from several group pages.)
  • Enhanced Search engine
  • with spell checker
  • orthographic and maybe synonimy suggest.
  • Iso-Latin Accents filtering
  • I would like to edit navigation toolbar and toolbox and also add more toolbars as is done in twiki

Recent list of tasks, unsorted[edit]

Tasks which really ought to be done

  • parser speedup and performance (wiki syntax description): 3
  • database schema redesign to allow for faster queries and to have a consistent CUR and OLD table scheme, important for directly linking to specific revisions which currently is only possible in the OLD table : 4
  • systematic review of all database queries for performance and scalability : 1
  • db performance (undefined) : 3
  • full internationalization of mediawiki (i.e. multiple languages in one installation, both in terms of content and UI; completely redesign interlanguage links to avoid massive redundancy between wikis) : 1
  • per user langage setting (allowing to access en.wikipedia.org with a french interface for ex).: 1
  • Clean up of LanguageXX.php files to store all translateable text in the MediaWiki: namespace for example is needed for user-selectable UI language : 1
  • image sharing across projects : 1
  • single sign-on (Wikimedia Commons) : 2
  • redesign file uploading and image pages; current image page system is unintuitive and redundant (captions duplicated on pages and in text) : 1
  • To have (more or less) public rsync for the images and bring up new projects like wikicommons or Wikipedia for deaf : 1
  • instead of storing every revision of every page, store incremental diffs with interleaved complete revisions as CVS and other version control systems do in order to decrease storage space by an order of magnitude or two : 1
  • redesign discussion system from the ground up so that talk pages, mailing lists and forum can be synthesized into one system : 1
  • implement web of trust system in order to allow for user-based filtering of RC : 1
  • Switching to UTF8 all projects : 1
  • real-time fundraising system for all Wikimedia sites : 1
  • Opening up WikiMedia source base by improving documentation (in source code, but also overall design spec) : 1
  • Easy mirroring worldwide for safer data : 1
  • Clean refactoring and rewrite : 1
  • Fixing bugs from sourceforge bug list : 1
  • structuring/prioritising feature requests : 1
  • The key database tasks include:
    • Splitting article titles from cur and eventually old, so page move vandalism moves do not take 800 or more seconds (the actual time for Village Pump was 850-900 seconds on the fastest database server).
    • Splitting the article text from old so it can be stored in a different database (and on a different database server, with big, cheap disks). This also speeds up page history queries even with current code.
    • Adding article id from cur to an old table, for faster searching/joining.
    • Adding per-article revision number to old for quickly going back to the revision 10,000 revisions ago instead of having to read all 10,000.
    • Making article id article revision the primary key for old. MySQL's InnoDB storage engine stores reciords clustered in primary key order. At the moment, going back 10,000 entries requires about 10,000 disk reads because every version is in a different database page. The changed primary key will group close entries for the same article together, dramatically reducing the number required.
    • Removing article text from cur to make that record smaller and make more efficient use of server RAM (fitting more records in it). The text is not often used compared to the rest of the record.
    • Adding a key cur_namespace, cur_is_redirect, cur_title to cur, a combination which is used by some significant slow queries and which could replace the use of the current cur_namespace, cur_title key (by using is_redirect in (0,1)) to reduce the amount of index data to be cached. This should ideally be part of a review of every query using cur, to determine the optimum set of indexes.
    • Load sharing so that the search queries can all be directed to a different pool of database servers, optimised for that job (diffrent memory allocation fo rsearch and other tasks). Key search speed factor is how much of the indexes are in RAM, normal servers can't allocate enough because they have to do dual duty.
    • Completing the work on the faster watchlist query.
    • Having someone who pays attention to database tuning review every query before production - some very inefficient queries have been written. Jamesday 12:29, 1 Oct 2004 (UTC)

Tasks in which people would like to get involved

  • Parser speedup : MMA
  • external editor / WYSIWYG support : EMO
  • make template syntax consistent and fix bugs : EMO
  • "watch newly created articles" : EMO
  • per page bans : EMO
  • cookie-based blocking : EMO
  • search in page histories : EMO
  • protected page redesign... : EMO
  • Database redesign : Timwi, Jamesday
  • New database shema : Has, Jamesday
  • Server administration : Dam, Jamesday
  • Bugzilla : Timwi
  • Finish geographical module : TAW


See also: Non-development tasks