What I got up to at Wikimania 2017, and some important things I learnt – by Alex Monk

I attended Wikimania 2017 (including the hackathon pre-conference) on a scholarship from the Wikimedia Foundation. Note that while I have done some work for them as a contractor in the past, I am not one right now, so this is entirely my own personal views etc.

Hackathon

Gadget stuff - jQuery migration project – phab:T169385

Along with all the extension commits you see on that ticket, I went across all our public wikis and performed over 750 edits to pages in the MediaWiki: namespace, to attempt ensure that all of the hundreds of gadgets found across public Wikimedia wikis continue to function after the next jQuery upgrade. To make these edits I use a tool named Tourbot, written by Timo Tijhof. Tourbot is configured with RegExp patterns and their replacements, and provided with a list of pages across public Wikimedia wikis to look at (generally from a list provided by a server administrator as they have access to a powerful global search tool known as mwgrep), will go through each page and make a series of change suggestions to the user, who has options such as approving the change, skipping a particular change, or skipping the file altogether. It’s generally only useful for people with the global editinterface flag. I made some additions to the patterns and also found and fixed several problems with the tool along the way:

Adding removeAttr pattern
Adding –help documentation of command usage, parameters, etc.
Error handling if the tool runs into an AbuseFilter warning
Adding .size() pattern
Filed a task about showing more details of errors (improvement the error handling fix above)

While making these edits, I stumbled across azbwiki, the South Azerbaijani Wikipedia (distinct from azwiki, the Azerbaijani Wikipedia). The reason this wiki stood out is that it seemed I wasn’t able to make any edits – every single edit we made hit an AbuseFilter. It turns out that some faulty abuse filters were seemingly-unintentionally blocking all actions made by users with low local edit counts. One was detecting anything containing a particular character that looks like an English full stop (if edit count is less than 20), and another was using a regex choice against the empty string (matching everything, if edit count is less than 500). Timo and I found someone at Wikimania we could trust to translate the relevant text for us and then we fixed things to allow editing by non-locally-established users again.

Huggle

When I joined Wikimedia in 2012, one of the first activities I got involved in was anti-vandalism patrolling, using a tool named Huggle. It’s been rewritten a few times since then, luckily to a language I can do some basic contributions in (C++, it used to be Visual Basic). I tried to contribute a little earlier this year, and a bit while at Wikimania. I still feel I should do more to this project in future, but it has an unfortunate habit of falling down my priority list, especially as I now have a full-time job.

Here’s a few of the things I worked on:

Cleanup of a lot of user interface text using poor English
Worked with Petr and Bryan to fix Huggle’s labs instance - e.g. manage to reconnect it to LDAP so developers could log in again
Fix some text that could not be translated out of English
Make the exact version number more prominent, remove some old ‘QT-LX’ branding that used to distinguish it from other now-abandoned versions
Add preferences UI section for a previously-hidden configuration setting

I hope to continue contributing to Huggle as time permits. Documenting what different translateable texts do so they can be more easily translated is on my list to deal with next – currently for most strings, all translators have to work with is the English text and other translations, not the level of documentation that other projects such as MediaWiki require and expect as a minimum standard to meet before merging.

Other

I worked with Kunal Mehta, Derk-Jan Hartman, and Nick Wilson to gather and write down our largely-unwritten expectations of new MW extensions.
I sat at the help desk for a while. We had a few people come and ask basic things like where can I find X person, and I also provided some technical help to someone working on some code that I believe was related to http://tools.wmflabs.org/etytree/
At the TechCom charter presentation, we discussed some interesting questions, especially the role of the Foundation’s CTO on the committee and how this will help send resources to technical requests for comment that are approved (the short answer is that they expect that when an RfC is made, the proposer already has the resources to implement it).

Main conference

I’m sure you can find a lot of information in the other attendees’ notes about what was discussed in most of the big keynotes (other than the one that didn’t allow writing of notes, photography, etc.), so I’m going to focus instead on the smaller sessions I attended where much fewer people attended. I picked largely technical sessions but hope to make this useful to non-developers. Not all sessions are included here - e.g. obviously the full details of the OTRS meetup aren't going to be published (but I think it's safe to say we mainly discussed internal issues that agents face).

This part of the report is based on my notes from the sessions as well as my own research I did into the subject afterwards.

The future of editing - Wikitext: upcoming changes, available tools, what you can do

This was a presentation by the Parsing team at the Wikimedia Foundation.

They want to move everything using MediaWiki’s default PHP-based parser (currently in use by the web view and iOS app) to Parsoid (written in JavaScript, currently in use by VE, the new VE-style wikitext editor, Flow, ContentTranslation, the Android app, the Linter extension, Kiwix and Google).

They also want to move from using Tidy (HTML4 compatible) to RemexHTML (HTML5 compatible). This may cause some pages to display differently (bugs such as breaking of lists inside table captions are gone), so editors may need to change some things if they want pages to render in exactly the same way. They have set up ?action=parsermigration-edit from the ParserMigration extension to show the page rendered under both systems. I tried this against some major Wikipedia’s Main Pages and found it shows some very minor differences. Unfortunately you need the ability to edit a page to use this action.

They’re hoping to make a series of changes to improve things that may require action on the part of editors (fixing of edge cases and improving semantics), of gadget authors and others who rely on the HTML structure of our pages (due to the above mentioned Tidy → RemexHTML migration), and of the developers of MediaWiki extensions (parser internals will no longer be exposed to extensions).

The Linter extension provides a Special:LintErrors page that shows various errors encountered when parsing pages. http://tools.wmflabs.org/wikitext-deprecation/ is a dashboard of all of these pages across wikis, and they are recorded weekly to mw:Parsing/Replacing_Tidy/Linter/Stats. They have several technical RfCs open:

an RfC to move from using <div> to using <figure> for media.
an RfC to introduce new syntax for multi-template blocks (e.g. blocks that are uneditable in VisualEditor). I suggest editors take a look at this since it concerns introducing new syntax.
an RfC to allow templates to specify how they should be used – by adding a #balance parser function:
- If set to ‘block’, instruct the parser to auto-close any block tags open before the template and auto-close any tags left open at the end of the template.
- If set to ‘inline’, instruct the parser to silently delete block-level tags in the output of the template, so it can be safely used inside block tags.
- If set to ‘table’, allow insertion inside tables and allow output of table data/header tags.

Wikidata & infoboxes panel

This was a combination of presentations from Hoo man, Theklan, and Eran about the use of Wikidata’s data in other Wikimedia projects.

This section expects basic familiarity with Wikidata.

Hoo man: Put data on it: Using Wikidata data in Wikimedia projects

The #statements parser function will display the value of any statement included in the item linked to the current page, you just need to specify the property you want to look up. You can do this either with the label of the property (like ‘country’), or the property number (like ‘P17’).

You can try this out by going to Special:ExpandTemplates, entering ‘Wikimania 2017’ as the title (this is linked to Wikidata item Q21113296), and inputting “{{#statements:country}}”. The result is HTML code including ‘Canada’. If you’re developing Lua module code, you can use mw.wikibase.getEntity( 'Q64' ) to get the entity object for Berlin. You can then call :formatPropertyValues( ‘P17’ ) to get the country. If it’s there are multiple claims you can add an extra parameter containing a list of acceptable ranks.

The Capiunto extension provides basic Infobox-generating functionality for Lua modules – the Lua is responsible for providing the data contained within however.

There was also talk of a proposed #infobox parser function. I haven’t been able to find more details about this.

Wikidata:How to use data on Wikimedia projects

Eran: Integrating Wikidata to Infoboxes

This is the system used on the Hebrew Wikipedia. It also appears to be working on the English one.

The syntax used to get the value of the current page’s official website property from Wikidata is {{#invoke:Wikidata|getValue|P856|FETCH_WIKIDATA}}

This relies upon the Wikidata module written in Lua.

One problem that comes up while using these is that Wikidata claims will often reference another entity that does not have a label – you do not want to show readers that identifier, so the data may be discarded. You may also opt to show the English label or another fallback language’s label.

Another problem with using Wikidata in infoboxes is gender – e.g. in some languages, occupations have different terms depending on gender. Hebrew Wikipedia has a tool for this.

There is also the parser functions provided directly by the Wikidata MediaWiki extensions - #property and #statements. #property provides only the labels without links, whereas #statements (only around since October) provides links. E.g., if you go to w:Special:ExpandTemplates, set the context title to ‘Devon’ and the input text to ‘{{#property:P17}}{{#statements:P17}}’ you will get just plain text from #property, but a wikitext link from #statements.

It will also handle images, so if you use P18 you will get an embedded file back. It doesn’t appear to handle coordinates in a special way.

You can have a system that even determines which infobox template to use – just by taking e.g. the P31 (instance of) property’s value from Wikidata and applying a mapping of such property-value pairs to templates.

Theklan: Automatic infoboxes with Wikidata

This is the system used on the Catalan and Basque Wikipedias.

They also have a Wikidata lua module, but it is not the same as the Hebrew Wikipedia one. It can only be directly called by infobox templates. The syntax that they use is {{#invoke:Wikidata|claim|property=P31}}.

In the case of multiple claims for a property, setting list=false will give just the first item. You can also set it to firstrank to get the highest ranked claim.

It seems that standardisation has become a bit of a problem – various wikis have different ways of accessing Wikidata’s data (e.g. different parameters for Module:Wikidata invocations).

Editing challenges in multi-script wikis

This was mainly a presentation by C. Scott Ananian of the Parsing Team from the Wikimedia Foundation.

In English and other western European languages, we write in the Latin script. Russian and Ukranian, among other eastern languages, are written in Cyrillic, Arabic in it’s own script, and Chinese, Korean, Japanese all have their own script(s) too.

However, some languages can be written in multiple scripts. In particular, Serbian can be written in either Latin or Cyrillic, and the Serbian Wikipedia uses a feature of MediaWiki called LanguageConverter to achieve this. Take a look at sr: - there is an extra menu up the top left that lets you switch between the two (also note the URLs).

Chinese can also be written in different ways – their Wikipedia allows you to choose between zh-tw (Traditional using Taiwanese terms), zh-sg (Simplified using Singaporean and Malaysian terms), zh-mo (Traditional using Macau terms), zh-hk (Traditional using Hong Kong terms), and zh-cn (Simplified using mainland terms).

There are various other languages also written in different ways that we discussed, e.g. Hindi (Devanagari script, left-to-right, used in India) vs. Urdu (Arabic script, right-to-left, used in Pakistan).

Punjabi actually has two Wikipedias: Western Punjabi (Shahmukhi script) and Eastern Punjabi (Gurmukhi script). So small that they don’t have articles about Lifts (this is an example he used that we’ll get back to).

You can find other examples at Wikipedias in multiple writing systems.

LanguageConverter is a part of MediaWiki core and is capable of converting between scripts (transliteration), between languages, or between words.

For example, on the English Wikipedia, you could theoretically have it handle ‘lift’ in en-GB vs. ‘elevator’ in en-US. Unfortunately that kind of thing doesn’t work in every case as “lift” can have different meanings depending on the context – in English you can “lift” an object, which is valid in both British and American spelling, but a “Lift”, a device used for elevating or lowering goods in the UK, Australia, New Zealand, and South Africa, is known as an “Elevator” in the US and Canada. So in some cases you’d want it translated and in some cases you wouldn’t.

LanguageConverter lets you use syntax like -{lift}- to indicate that the inner text should not be converted. It helps readers by providing them with content that was written once in one way, but then providing it in the reader’s preferred way. Unfortunately it also makes editing harder e.g. due to extra syntax. Editing on wikis which employ LanguageConverter could be easier, and work has recently been done to add support to Parsoid and add support to VisualEditor. As of writing, such wikis have reduced VisualEditor deployment.

Wikipedia will soon speak to you – Wikispeech

This was a presentation and Q&A by some people from WMSE.

I was drawn to this session mainly by the first two sentences of abstract: "The development of Wikispeech, a new MediaWiki extension, has started. By the end of 2017 Wikipedia should be able to speak to you."

They have put up a demo of this project on wikispeech.wmflabs.org – it basically is a MediaWiki extension with a backend service.

What it basically does is put the text of wiki pages through a text-to-speech system and play the audio out to you.

Unfortunately as this involves deploying code to Wikimedia production, there’s quite a bit of process to go through first – they’re going to need various reviews which I brought up during the Q&A – from mw:Review queue#Preparing_for_deployment:

A security review: Open a security review task and mark it as a subtask of the main deployment task

A product review, if applicable

A design review, if applicable.

A beta feature review, if your extension adds a beta feature.

If you have reasons to think that a database review is needed, create a request in Phabricator.

Any serious issues identified in these reviews must be addressed before deploying your code.

Apparently Brion Vibber is going to handle their security review, I haven’t been able to find a task about this. It’s possible it’s found some security vulnerabilities in which case it may have been moved into the private area.