Community Tech/Edit summary length for non-Latin languages

From Meta, a Wikimedia project coordination wiki

This page documents a project the Wikimedia Foundation's Community Tech team has worked on or declined in the past. Technical work on this project is complete.

We invite you to join the discussion on the talk page. You may track this project's progress on T6715.

Tracked in Phabricator:
Task T6715

The Edit summaries for non-Latin languages project aims to give contributors in non-Latin languages more available space for writing edit summaries. Edit summaries are measured in bytes, rather than characters, and the standard Unicode characters are based on Latin script, specifically the 26 letters in the English alphabet.

In non-Latin languages -- Cyrillic, Semitic, CJK, Indic and so on -- each character takes up more than one byte, which means that edit summaries have to be shorter. For example, English letters take up 1 byte, Latin letters with diacritics like ñ take up 2 bytes, and Cyrillic letters like Ж take up 3 bytes. On Russian Wikipedia, edit summaries are 1/3 the size of English WP's, which means contributors have to condense quite a bit -- in this ex., the ed. has abbrev. ev. wd.

This project was the #2 request on the 2016 Community Wishlist Survey.

The feature has released to all wikis of all languages and extends the edit summary length to 1,000 characters.

Important links[edit]

Status[edit]

March 7, 2018[edit]

The user interface changed on March 1, 2018 to allow for edit summaries of 1,000 characters on all wikis of all languages. This change was not widely announced and some discussion occurred on wikis if it should be reverted, altered, or left as-is.

Given that this change has released, I think the best discussion we can have is around "what effect are extended edit summaries having on wiki" and "what, if anything, should we do to respond to the change. Reverting this change will have adverse side effects (technical and likely social) a I'd like to avoid any unnecessary whiplash. As a product manager, I think reverting will do more harm than good and we should not make any changes until we are sure they will benefit the majority of users.

Potential changes in the future:

  • Leave as-is at 1,000 characters (simplest & most preferred technical solution)
  • Limit the edit summary to 500 characters on all wikis — phab:T188798
  • Implement visual truncation for long edit summaries on recent changes, histories, and other log pages. — phab:T6717 (Product manager note: I think these are best handled as gadgets or user scripts.)

March 2, 2018[edit]

While there is more backend cleanup work that needs to be done and related changes have been proposed, and some extensions may still need conversion to the new system, the request as stated here (increase the length of edit summaries) seems to be complete, deployed, and even complained about on enwiki.

November 28, 2017[edit]

This project was been picked up by the MediaWiki Platform Team in February 2017, partially motivated by this long-standing request but pushed to the forefront by the need to reduce the size of the revision table to ease future schema changes. Code is written and is in the process of being deployed; the tracking task for the deployment is T166733. Current ETA for the schema change to be done everywhere is mid-February, after which we could enable use of the higher limit on all wikis. We'll likely enable it on testwiki and mediawiki.org for more testing in early or mid December, once the schema change is done for those wikis (ETA early December).

The code is already live for testing on the Beta Cluster. The new limit is currently set at 1000 Unicode characters. Note that the web UI has not been updated with the new limits, so the edit summary field is still supporting the old limit of 255 bytes, but edits via the API can use the fill limit. Patches from developers familiar with the way these limits are applied in the web UI would be appreciated.

March 8, 2017[edit]

Brion VIBBER is planning to work on this, breaking out the comment field as part of a larger update to the revision tables. There are notes on mediawiki.org: Compacting the revision table round 2 and in the minutes of a Feb 2017 ArchCom meeting: RFC meeting.

March 1, 2017[edit]

Important question to ask Jaime: Can we expand the comments field to 760 bytes if we create a client-side limit of 256 characters for Latin languages? That would increase the load on that table, but not nearly as much as if we let everybody post 760 byte summaries.

Also, Daniel Kinsler says that Structured data/Multi-content revisions won't touch the comments table -- if that's the route we take, then either CommTech or Editing will have to do it. It would be complicated to implement, because any code that touches comments would have to be refactored to use the new DB schema. Toollabs tools would need to be updated, and we'd have a migration period when we notify people. It would be much easier and less headachey to change the main revision table and add the client-side limit, as described above.

February 27, 2017[edit]

Increasing the length of the edit summary column in the revision table isn't practical; making the change (without taking down the site) would be exceedingly difficult, and once it was done, it would make a table that is already too big even bigger.

The solution that we'll use is to create a new table that's just for comments, and link the existing rev table to the comments table for each item.

This will happen as part of the Structured data on Commons project, led by Wikimedia Deutschland and the WMF Multimedia team. They're currently working on multi-content revisions, which will involve refactoring the revision table and creating the new comments table. This will be deployed on Commons first, to support Structured data, and then rolled out on other wikis. We can prioritize non-Latin languages in that rollout process, so the languages that need longer edit summaries will get them.

We don't want to encourage Latin languages to post 3x longer edit summaries, because edit summaries aren't intended to be a primary communication method. So we'll put a limit on the size -- probably 250 characters, rather than 250 bytes, which in Latin languages would mean no change at all. This will put non-Latin and Latin languages on par for edit summary length.

The Architecture Committee has met and agreed that this is the best way forward.

December 19, 2016[edit]

The relevant ticket is T6715. Bawolff wrote a patch in 2015 that raises the limit from 255 bytes to 767. This requires a major update to the main revision table, which is enormous and very hard to change. There isn't an easy way to do this -- it's the kind of thing where the plan includes "all big wikis are read-only for 2-5 days."

We were hoping that this could be done only for the affected languages -- Russian, Chinese, Hebrew, etc -- but the DBA, Jaime Crespo, says this is a no-go. It would create a fork that we would have to keep updated; every change would have to take the difference into account.

It's possible that the way to solve this (and other issues) is to shard the database, i.e. split it into parts so that database updates can run much faster. The problem is much bigger than just this particular request; successfully sharding the database could solve a number of problems. Still, it's a big investment, and not easily done.

Community Tech's role in this project will probably be advocating for a solution, rather than actually doing technical work. This is the #2 request on the wishlist survey, which demonstrates how important this change is, and gives us a voice in these discussions. We'll make sure this isn't forgotten, and we'll be responsible for reporting back to the community on what's going on.

There will be discussions about this problem at the Wikimedia Developer Summit in early January; hopefully we'll have an update after the summit.