Community Tech/Edit summary length for non-Latin languages
The Edit summaries for non-Latin languages project aims to give contributors in non-Latin languages more available space for writing edit summaries. Edit summaries are measured in bytes, rather than characters, and the standard Unicode characters are based on Latin script, specifically the 26 letters in the English alphabet.
In non-Latin languages -- Cyrillic, Semitic, CJK, Indic and so on -- each character takes up more than one byte, which means that edit summaries have to be shorter. For example, English letters take up 1 byte, Latin letters with diacritics like ñ take up 2 bytes, and Cyrillic letters like Ж take up 3 bytes. On Russian Wikipedia, edit summaries are 1/3 the size of English WP's, which means contributors have to condense quite a bit -- in this ex., the ed. has abbrev. ev. wd.
We can't change the way Unicode works, but it may be possible to increase the length of the edit summary. This will require some complicated database changes that could seriously impact performance while the change is being made. There are discussions going on with the database administrator about how to handle this. Current options include:
- Apply a patch increasing the size of the comment field, knowing it is a time bomb.
- Separate comments on a separate table, do patches to avoid querying comments almost everwhere where possible, be aware that this is a mere quick patch, and not a proper fix, but at least it resolves the current ticket while maybe other refactoring is under work (avoid analysis-paralysis).
- Wait for a proper refactoring of revision table, which may take more time, and may even reach the same conclusion as #1; but after all, if we have waited so many years, waiting some months is not that bad.
This project was the #2 request on the 2016 Community Wishlist Survey.
November 28, 2017
This project was been picked up by the MediaWiki Platform Team in February 2017, partially motivated by this long-standing request but pushed to the forefront by the need to reduce the size of the revision table to ease future schema changes. Code is written and is in the process of being deployed; the tracking task for the deployment is T166733. Current ETA for the schema change to be done everywhere is mid-February, after which we could enable use of the higher limit on all wikis. We'll likely enable it on testwiki and mediawiki.org for more testing in early or mid December, once the schema change is done for those wikis (ETA early December).
The code is already live for testing on the Beta Cluster. The new limit is currently set at 1000 Unicode characters. Note that the web UI has not been updated with the new limits, so the edit summary field is still supporting the old limit of 255 bytes, but edits via the API can use the fill limit. Patches from developers familiar with the way these limits are applied in the web UI would be appreciated.
March 8, 2017
Brion VIBBER is planning to work on this, breaking out the comment field as part of a larger update to the revision tables. There are notes on mediawiki.org: Compacting the revision table round 2 and in the minutes of a Feb 2017 ArchCom meeting: RFC meeting.
March 1, 2017
Important question to ask Jaime: Can we expand the comments field to 760 bytes if we create a client-side limit of 256 characters for Latin languages? That would increase the load on that table, but not nearly as much as if we let everybody post 760 byte summaries.
Also, Daniel Kinsler says that Structured data/Multi-content revisions won't touch the comments table -- if that's the route we take, then either CommTech or Editing will have to do it. It would be complicated to implement, because any code that touches comments would have to be refactored to use the new DB schema. Toollabs tools would need to be updated, and we'd have a migration period when we notify people. It would be much easier and less headachey to change the main revision table and add the client-side limit, as described above.
February 27, 2017
Increasing the length of the edit summary column in the revision table isn't practical; making the change (without taking down the site) would be exceedingly difficult, and once it was done, it would make a table that is already too big even bigger.
The solution that we'll use is to create a new table that's just for comments, and link the existing rev table to the comments table for each item.
This will happen as part of the Structured data on Commons project, led by Wikimedia Deutschland and the WMF Multimedia team. They're currently working on multi-content revisions, which will involve refactoring the revision table and creating the new comments table. This will be deployed on Commons first, to support Structured data, and then rolled out on other wikis. We can prioritize non-Latin languages in that rollout process, so the languages that need longer edit summaries will get them.
We don't want to encourage Latin languages to post 3x longer edit summaries, because edit summaries aren't intended to be a primary communication method. So we'll put a limit on the size -- probably 250 characters, rather than 250 bytes, which in Latin languages would mean no change at all. This will put non-Latin and Latin languages on par for edit summary length.
The Architecture Committee has met and agreed that this is the best way forward.
December 19, 2016
The relevant ticket is T6715. Bawolff wrote a patch in 2015 that raises the limit from 255 bytes to 767. This requires a major update to the main revision table, which is enormous and very hard to change. There isn't an easy way to do this -- it's the kind of thing where the plan includes "all big wikis are read-only for 2-5 days."
We were hoping that this could be done only for the affected languages -- Russian, Chinese, Hebrew, etc -- but the DBA, Jaime Crespo, says this is a no-go. It would create a fork that we would have to keep updated; every change would have to take the difference into account.
It's possible that the way to solve this (and other issues) is to shard the database, i.e. split it into parts so that database updates can run much faster. The problem is much bigger than just this particular request; successfully sharding the database could solve a number of problems. Still, it's a big investment, and not easily done.
Community Tech's role in this project will probably be advocating for a solution, rather than actually doing technical work. This is the #2 request on the wishlist survey, which demonstrates how important this change is, and gives us a voice in these discussions. We'll make sure this isn't forgotten, and we'll be responsible for reporting back to the community on what's going on.
There will be discussions about this problem at the Wikimedia Developer Summit in early January; hopefully we'll have an update after the summit.