Research:Usage of correct diacritics in readers of Romanian Wikipedia

From Meta, a Wikimedia project coordination wiki
Duration:  2016-April – 2016-July
diacritics, ro.wp
This page documents a completed research project.


In 2010, the Romanian Wikipedia moved from the incorrect S-cedilla and T-cedilla letters to the correct S-comma and T-comma. The change was decided following an extensive poll which showed that approximately 10% of the viewers could not see the correct letters without help[1]. However, the results were plagued by considerable errors generated from false responses.[2]

In the last 6 years, a lot has changed in the way diacritics are rendered in browsers, but also in the way we detect incorrect diacritics in the site's JavaScript. Also, several campaigns promoting the use of the (correct) diacritics have probably increased the use of the correct letters. A new study which would show how many users still have issues with displaying the correct diacritics (they see squares or letters with a different font) would help the community decide if all the technical systems put in place to help ease the passage are still needed. It would also help external stackeholders take informed decisions regarding the migration and/or use of diacritics, since Wikipedia is visited by a very diverse and numerous population.


Methods[edit]

We will use 2 different methods to evaluate the state for viewing and writing the correct diacritics. All the server code will be placed in the tool labs.

Viewing[edit]

For all users we will run a JavaScript code that was created by Cristian Adam in 2010[3] and was adapted for use on the Romanian Wikipedia. This script can detect any inconsistency in displaying the old diacritics vs. the new diacritics. We will also set a cookie to the user's browser so we avoid double voting.

The following data will be recorded on the server:

  • The city and country of the user (derived from the IP through geoiplookup.wikimedia.org)
  • The User agent used by the user.
  • Whether the user has issues seeing all new diacritics
  • Whether the user has partial issues seeing the new diacritics (fonts with s-comma but t-cedilla)

Expected results[edit]

The test should be able to tell us the percentage of people who still have issues (however minor) seeing the correct diacritics. It will not tell us exactly how big the issues are.

Editing[edit]

All logged in users will be shown in the sitenotice a short form containing a request to write a word containing both ș and ț[4] with diacritics in an edit box, as well as a Submit button. No validation will occur, which will allow us to see how many false responses we get.

When submitting the form, the following data will be recorded on the server:

  • The text the user wrote in the edit box
  • The city and country of the user (derived from the IP through geoiplookup.wikimedia.org)
  • The User agent used by the user.
  • All the data gathered with the Viewing method, regardless of whether this has been collected before for this user

We will then set a cookie that allows us to know the user voted and display a thank you message.

Expected results[edit]

We expect to be able to divide the users with valid votes in the following categories:

  • no issues - the user can see and write with the correct diacritics like with any other letter
  • issues writing - the user can see the correct diacritics, but writes (on purpose or by need) with the old ones.
  • issues seeing - the user can write with the correct diacritics, but has issues seeing them (probably the browser uses font replacement to display them)
  • both issues - the user has both issues seeing and writing with the correct diacritics. We hope that this category will be limited to old browsers and OSes.

The results will be relevant if we can get at least 20 valid votes.

Timeline[edit]

  • April
    • create the user-side code for the visibility part and test it without any backend
    • start the design of the editing poll
  • May -> early June
    • create the server-side code
    • beta-testing for the visibility part
    • finalize the design of the editing poll
    • beta-testing the editing poll
  • early June
    • start both the poll and the visibility script and let them run for a week
  • June -> July
    • publish raw data
    • interpret the results and publish conclusions

Policy, Ethics and Human Subjects Research[edit]

The study involves running a JavaScript code on all ro.wp users' machines without their explicit consent. However, the personal data gathered is the same as the data gathered by the WMF in the normal operation of the site. Also, we will not register the IP addresses, usernames or any other means to associate wiki contributions with the result.

Results[edit]

Raw results are available at Research:Usage_of_correct_diacritics_in_readers_of_Romanian_Wikipedia/Results. Thanks go to milimetric for writing the SQL queries needed and for his continuous support regarding EventLogging.

Viewing[edit]

Using EventLogging instead of a poll meant we eliminated intentionally false responses, which still have a high incidence in user participation (see below). It also allowed to gather a huge number of samples (over 1.1 million), which smoothed any issues created by the problem with cookies that we had in the first day (cookies were set for the session, so security-conscious users would get recorded at each session). Just to be sure, we also extracted the responses made from June 1st (after the issue was resolved) and the differences are about 0.1%, so we decided to go with the whole set of results.

Compared with the previous poll, the number of people that can see the correct diacritics without any help has increased tremendously, from ~40%[5] to over 91%. This is a slightly smaller number than the one suggested by current statistics for operating system of the visitors of Wikimedia websites, but is in line with global stats for Romania (80% of respondents were from Romania) and indicates that the display issue will go away in a few years, when all Windows XP machines will be retired.

It is surprising that we have such a big number of visitors with issues with only one of the letters (0,37%). This might be worth investigating further, to clarify if there is some kind of issue with the code identifying the differences in appearance.

Writing[edit]

The first thing we noticed is the high incidence of false responses (over 1/3 of the total). Based on the comments from these responses[6], we can also deduce the number of "no issues" is a bit overstated (as some responders copy-pasted it instead of writing the word using the keyboard. This should be prevented from the code in future studies.

We reached a milestone in the sense that more than half of all the users now use correct diacritics (52,14%). Considering only valid responses, the percentage of people using the correct diacritics is almost 82%, still significantly under the percentage obtained for viewing diacritics. We are still a number of years away from being able to remove the script that ensures corrections on input.

Conclusions[edit]

We need to continue investing resources in developing the code for correcting diacritics so it works correctly on all editors (Wikitext, Visual and CX). In order to get a more accurate estimation of the moment when display correction can be removed, we would need OS trends for ro.wp from the WMF Analytics team (specifically for Win XP, Win 95/98/Me, but also for old Android versions). Lacking that, we could continue repeating the polls every 1-2 years.

References[edit]

  1. http://www.moongate.ro/products/diacritice/sondaj/date.php#statistici
  2. Bogdan Stăncescu (2010-05-09), Factorii care influențează momentul optim de migrare la diacriticele corecte în limba română (PDF) 
  3. http://cristianadam.blogspot.ro/2010/10/ro-diacriticele-si-internetul.html
  4. The specific word is to be determined. Proposals include arșiță, șuț and țuști
  5. Concluziile sondajului din 2010
  6. Responses included phrases like "I copy-pasted" or bits of phrases from the text asking the users to write the word arșiță using their keyboard, but also full phrases from articles and random text