Research:Your Language, My Language, Our Language: Studying International Auxiliary Languages in Wikipedia with Data Science.

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
VisualEditor - Icon - Check.svg
This page documents a completed research project.

In this study we will analyze the behavior of Wikipedia users that edit in International Auxiliary Language (IAL) projects with data science. We want to obtain a global view of how they interact with Wikipedia in their specific contexts by understanding their co-editing behavior with respect to other top wikis.

Introduction[edit]

An International Auxiliary Language (IAL) is a language meant to serve as an alternative means of communication between people who do not share a common native language and is therefore a foreign language for them. It differs from a Lingua Franca in that it attempts to distance itself from the cultural bias present in them and therefore serves as a generalized way of communicating. This goal has been a difficult one to reach and over the years it has been argued that it has not been accomplished, although the communities that support these languages argue against these claims.

It is worth noting that an IAL may have a rather incomplete vocabulary or grammar. Hence it may not be an ideal way of communicating complex ideas that may be easy to express in other, more mature languages.

Our goal is to understand how users from IALs projects behave. Not only would it be necessary to study them in their context, but also how they interact with wikis in some of the top languages. Let us consider some of the top languages from those with the most articles.

IALs used in Wikipedia[edit]

The following is a comprehensive list of IALs currently in use in Wikipedia:

IAL Wiki
Simple English simple
Esperanto eo
Ido io
Interlingua ia
Novial nov
Interlingue ie
Volapük vo

Methods[edit]

In order to analyze the users' behavior we will obtain the necessary data from Quarry and then analyze the datasets using Python.

Datasets[edit]

We have made different queries for retrieving the relevant data. On the following sections you can find links to the Quarry pages where the queries' code is also available.

Language co-editions[edit]

How do users that edit in IALs projects co-edit in other IALs?

How do users that edit in some of the top Wikipedia projects co-edit in IALs?

Users' edit count in top projects[edit]

Articles per editors[edit]

Given the amount of articles and editors in an IAL project, what is the ratio between articles and editors?

Results[edit]

We will now analyze in detail the datasets so as to be able to make conclusions from them.

Amount of editors in each IAL[edit]

IAL projects users in Wikipedia

Using this program we were able to use each co-editions dataset to extract the amount of editors in each project. The graph in the thumbnail depicts the results.

We can observe that Simple English is the top IAL project with almost 70000 editors, and by a large margin since the next top wiki edited in an IAL is Esperanto with roughly 80% less editors. The project with the least editors is Novial, as we can see in the graph since its data is sorted, having just under a thousand editors.

Users' co-editions in other languages[edit]

Using this program we were able to visualize the flow of users that edit in a particular language into IALs, i.e., we're visualizing out of the total amount of users that edit in a particular language, how many of them edit in IALs. Notice that the numbers do not sum to one because there are repetitions, i.e. users that co-edit in some IALs appear repeated. The following Sankey diagrams show these results:

Several interesting conclusions can be made from these.

Let us start by noting that IALs project editors form a community in the sense that they all co-edit among the different IALs except for Simple English editors which barely edit in others. They edit the most in Esperanto, but it's still a small amount (around four thousand of them), and the least in Novial (around four hundred of them). Thus their co-editing behavior in other IALs is rather insignificant.

Among the other IALs, they co-edit each other fairly significantly. In the graphs, the wider the trunks the more flux of editors passes through the other IALs. Volapük users co-edit in Ido around as much as Ido users co-edit in Volapük, relatively speaking. Most IAL editors co-edit in Simple English, except for Interlingue and Novial which co-edit the most in Esperanto. Novial is the IAL whose editors co-edit other IALs the most, which we can notice with its graph's trunk being the widest. Nevertheless, it is also the IAL that all other IALs co-edit the least. This could be an indicator that users that edit in this IAL are not fond to communicating exclusively in this language and therefore also speak others.

In regards to the selected top languages, we can observe that Spanish editors are the ones who co-edit the least in IALs, relatively speaking, while Portuguese's the most. All of these languages' editors co-edit the most in Simple English and the least in Novial, closely followed by Interlingue. This is similar to what we found for the IALs co-editions previously specified. In general, however, we note that the percentage of editors in IALs is very small in all of these. Given the significance of these languages' editor base for Wikipedia, we can say that in general terms most users do not edit in IALs, a fact contrasted by the results of the total amount of editors previously discussed.

Users' edits in IALs and top languages[edit]

Boxplots of users' edits in some of the top wiki languages and in all IALs, ordered by the 75-percentile.

Using this program we were able to analyze the global edits in each language. In the thumbnail graph we can see some relevant facts about the number of edits in each one using a Box Plot.

We can immediately observe a rather surprising trend that shows that the median of edits of all IALs and most of the chosen top languages is two. Moreover, the 25-percentile in all of the languages we're considering is one.

The IALs whose vast majority of editors edit the least are Simple English and Ido, i.e. the 75% percentile of their populations' edits is the smallest compared to other languages. This is despite the fact that the former is the IAL that receives the most amount of edits, which shows that very few users are the ones responsible for all of these edits (more on this on the next section). In contrast, Esperanto is the IAL whose vast majority of editors edit the most. This means that it is the most balanced IAL in the sense of collective collaborations.

It is interesting to note that these numbers do not differ much from the ones of the top wikis we selected despite them possessing a much larger editor base, which suggests a huge inequality or imbalance in edits per user globally.

Users' edits distribution as a Power Law[edit]

Using this program we fitted the IAL edits per user into a Power Law. The following graphs show the fitted model along with its two important coefficients: which represents the slope of the model and which represents the standard error of the fit.

PDF of power-law distribution of edits per user in IAL' projects in Wikipedia.
CDF of power-law distribution of edits per user in IAL projects in Wikipedia.

Since we've had at our disposal complete data from the whole populations we've considered, and given the fact that the edit distributions amongst projects do follow a Power Law distribution, we can see in each graph that . Also notice that since in all models, their variance tends to infinity[1], so it's pointless to analyze them individually.

We have provided views for both the Probability Density Function (PDF) and then Cumulative Distribution Function (CDF) because they both express the same information in meaningful, different ways. The PDF allows us to clearly see the fitted model and its slope and the CDF allows us to see how unbalanced the edit distribution is, so that steeper slopes represent shift points in which the unbalancing is strongest.

In general, in all graphs it can be seen that most users do very few edits while very few users do most edits. This contrasts the data in our Boxplots. This behavior is expressed in all graphs, showing that this is a common trend.

Consider now the first graph in the thumbnail, where we combine all PDFs computed before so as to be able to compare them better.

We can observe that most fitted functions start in the same place, i.e. describing that the probability of making less than 10 edits is the highest, and only Volapük and Simple English reaching to extreme values of edits per user with the lowest probability. Among those who started marking probabilities from few users in the graph, both Novial and Interlingue spread the least, stopping at . We can also point out that Ido is the most "condensed" IAL in this sense, since the probability of users making less than and more than edits is undefined.

Now let us observe a comparison between the CDFs of all distributions in the second graph.

This contrasts our previous observations. We can point out that between and the slope of all curves changes the most, which means that the highest amount of edits, in general, is concentrated in that region.

Articles per editors[edit]

Ratio of articles per editor in IAL wiki projects

An interesting measurement we can take into account consists of the ratio between articles per editor in IAL projects. Using this program we've been able to achieve this.

A ratio of one would mean that there is an equal amount of both; a ratio larger than one would tell us how many articles per editor there are; a ratio smaller than one would indicate that there are more editors than articles. The graph in the thumbnail summarizes the results.

We have to analyze these numbers carefully since, as we saw in the last section, there exists a power-law relation between these two variables, so we cannot assume that the articles have been edited on equal parts by all editors. Still, it's interesting in general terms.

Conclusions[edit]

Our aim has been to obtain a global view of the behavior of users that edit in IALs projects in Wikipedia. We have been able to study the whole population of users obtaining the data from Quarry. As a result, we have observed that:

  • Most IAL projects have less than editors, except for Esperanto with around and Simple English with around . To put this in perspective, Simple English, the largest IAL project, is approximately of what the English project is (first largest) and approximately what the Spanish project is (second largest).
  • Most IAL projects' users co-edit frequently in among other such projects, thereby forming a community of editors that edit in multiple IALs. This doesn't happen with Simple English, however, since these editors edit the least in other IALs. In comparison with some of the top languages, Portuguese's editors are the ones who co-edit the most in IALs, while Spanish's the least.
  • is the median of edits per users in all IAL projects. It is likewise for most of the top languages we analyzed, except for Italian and Japanese. This suggests most users are simply casual editors that make very few edits and never make anymore.
  • The edits per users in IALs follows a Power Law distribution, i.e. most users do very few edits while very few users do most.
  • Volapük editors are the ones who have collectively worked the most, since they have the highest ratio between articles and editors at around . Novial, however, is all the opposite with a ratio of around .

References[edit]

  1. Jeff Alstott, Ed Bullmore, Dietmar Plenz (2014) Powerlaw: a Python package for analysis of heavy-tailed distributions, p1. arXiv:1305.0215