Research:Incubator and language representation across Wikimedia projects

From Meta, a Wikimedia project coordination wiki
Created
19:07, 27 June 2023 (UTC)
Collaborators

HGhani-WMF

ILooremeta-WMF
Duration:  2022-December – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


The Wikimedia Foundation supports more languages than any other large online platform.[1] Currently, more than 320 languages have one or more editions of open-content Wikimedia projects.[2] However, another way of looking at that statistic is: currently, fewer than 5% of the world’s living languages have at least one edition of a Wikimedia project. [3] How can Wikimedians create new language editions? Via Wikimedia Incubator.

Goals of this project:

  • Develop metrics for the state of languages at Wikimedia
  • Develop metrics for better understanding Incubator
  • Develop knowledge gaps metrics for measuring language gaps

Progress on this project can be followed at T348246

Background[edit]

Why language?[edit]

Language representation is connected to the Foundation's mission and strategic goals as well as the Research team's Knowledge Gaps project.

The Foundation’s mission “is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally”.[4] UNESCO estimates that as much as 40% of the global population does not have access to education in a language they speak or understand.[5] If effective educational content dissemination is central to the Foundation’s mission and receiving instruction in one’s home language has an across-the-board positive impact on learning,[5] then language representation is of central concern to the Foundation’s mission.

The Research team's Knowledge Gaps project draws from the Wikimedia movement's strategic goal of supporting “the knowledge and communities that have been left out by structures of power and privilege”.[6] In 2018-2019, the Research team began to advance knowledge equity with a research program to address knowledge gaps. The project aims "to deliver citable, peer-reviewed knowledge and new technology in order to generate baseline data on the diversity of the Wikimedia contributor population, understand reader needs across languages, remove barriers for contribution by underrepresented groups, and help contributors identify and expand missing content across languages and topics."[2] One of the representation gaps identified in the knowledge gaps taxonomy (for readers, content, and contributors) is language.

In order to understand language representation across Wikimedia projects, we would like to understand the following:

  • What is the state of languages at Wikimedia? What is our global coverage? (RQ1)
  • Are we reaching as many people as possible, by hosting content in the world’s biggest languages? (RQ2)
  • Are we serving the members of marginalized language communities? (RQ3)

Why Incubator?[edit]

The journey through Wikimedia Incubator is a key step for language communities to engage in new Wikimedia projects, and thereby new forms of knowledge in their languages. That is because Wikimedia Incubator is where new-language versions of Wikipedia, Wiktionary, Wikibooks, Wikivoyage, Wikiquote, and Wikinews are arranged, written, and tested before Wikimedia hosting. (Note: New-language versions of Wikiversity go to Beta Wikiversity, and new-language versions of Wikisource go to Multilingual Wikisource). Once an Incubator project is deemed ready for Wikimedia hosting, it “graduates” from the Incubator and receives its own domain. The Language Committee determines whether the project is worthy of graduating by assessing its meeting of the basic requirements and guidelines for Incubator projects. Upon graduating, the project’s content is then exported to its new domain.This process occurs manually and can take multiple days, or sometimes weeks. (*Note: Incubator was created in June 2006. Before Incubator’s creation in June 2006, test projects were launched and edited on Meta-Wiki).

Obstacles related to Incubator have been identified by multiple stakeholders in the past, including Peter Gallert in his 2018 Wikimania talk, "Wikipedia for Indigenous Communities", and Amir Aharoni in his 2020 Celtic Knot talk, "How We Can Make the Incubator Better." These obstacles include: learning curve; prefixes; visual editor problems; incomplete Wikidata support; no content translation; no specific wikistats; lack of automation; and configuration inequity.

Given these challenges, what we would like to understand is:

  • What is the state of past and present projects in Incubator, and what implications does it have for language equity? (RQ4)
  • What obstacles prevent successful and/or timely graduation from the Incubator? Do these obstacles affect certain languages more than others? (RQ5)

Research questions and proposed variables[edit]

(RQ1) What is the state of languages at Wikimedia?

  • # and % of world's language speakers/readers with access to educational content via Wikimedia projects
  • # and % languages with (0,1,2,3,etc.) Wikimedia projects
  • # and % languages with projects in Incubator
  • # and % languages with projects that have been sent back to Incubator (i.e., closed)

(RQ2) Are we reaching as many people as possible, by hosting content in the world’s biggest languages?

  • # and % world’s top languages with at least (0,1,2,3,etc.) Wikimedia projects
  • # and % world’s top languages with a Wikipedia, Wiktionary, etc.
  • # and % world’s top languages with projects in Incubator (incl. #, time in Incubator, etc.)

(RQ3) Are we serving the members of marginalized language communities?

  • # and % of world’s minority languages with at least (0,1,2,3,etc.) Wikimedia project
    • Of those with zero, # and % with project(s) in Incubator (incl. time in Incubator)
  • # and % of world’s threatened languages have a  Wikipedia, Wiktionary, Wikisource
    • Of those with zero, # with project(s) in Incubator (incl. time in Incubator)

(RQ4) What is the state of past and present projects in Incubator, and what implications does it have for language equity?

  • Average length of time an Incubator project spends in Incubator
  • Average length of time between an Incubator project's creation and first meaningful edits
  • Average length of time between an Incubator project's last meaningful edits and Incubator graduation
  • Variables predictive of time spent in Incubator
  • Variables predict of successful graduation from Incubator

(RQ5) What obstacles prevent successful and/or timely graduation from the Incubator? Do these obstacles affect certain languages more than others?

These questions are being addressed as part of the Wikimedia Language engineering/Incubator conversations.

Data sources[edit]

This project uses data from multiple sources:

Additional details about these sources are provided in the source data folder of this project's GitLab repo.

As the list above shows, data related to Incubator projects and the Incubator process live in multiple places. In the table below, the Incubator process is outlined with links to the references data sources.

Table. Incubator process with linked data sources

Step 1. Request for new language version of a Wikipedia, Wiktionary, Wikibooks, Wikiquote, Wikinews, or Wikivoyage.

Language must have an ISO code.

Step 2.a. Request approved

(Continue to Step 3.)

Step 2.b. Request denied
Step 3. Test wiki created in Incubator

These wikis don’t have their own domain (they live under incubator.wikimedia.org, with the URL formatting of incubator.wikimedia.org/W[a-z]/[a-z]) and they don’t have their own database name (they live within incubatorwiki)

Step 4. Test wiki is edited by volunteers
Step 5. Request for approval from Language Committee
Step 6.a. Language Committee deems criteria met for Wikimedia hosting, including finding an expert to validate content

(Continue to Step 7.)

Step 6.b.Language Committee deems criteria not met for Wikimedia hosting

(Go back to Step 4.)

Step 7. Test wiki graduates from Incubator

Its content is copied and migrated to its own domain.

Step 8. Wiki lives in the real world!

These wikis now have their own URL/domain and their own database name (e.g., af.wikipedia.org; e.g., afwiki)

(Step 9.a. No closure requested)

(Continue to Step 10.)

Step 9.b Closure requested
Step 9.b.i. Closure request denied

(Continue to Step 10.)

Step 9.b.ii.Closure request approved

(Return to Step 3.)

Step 10. Wiki lives in the real world forever…

Metrics[edit]

Below are proposed quantitative metrics related to the state of languages across Wikimedia projects. This list of metrics is in progress.

Regarding some nomenclature in the tables below:

  • CPSLEs refer to the eight content projects specialized by linguistic edition: Wikipedia, Wiktionary, Wikinews, Wikivoyage, Wikiquote, Wikiversity, Wikisource, and Wikibooks.
  • Hosted projects refer to Wikimedia projects that are hosted by the Wikimedia Foundation and have their own domain (e.g., fr.wikipedia.org).
  • Test projects refer to content project editions in the Wikimedia Incubator (for Wikibooks, Wikinews, Wikipedia, Wikiquote, Wikivoyage, and Wiktionary), Wikiversity Beta, and Multilingual Wikisource.
  • Closed projects refer to Wikimedia projects that were previously hosted by the Wikimedia Foundation (with their own domain) but which have since been closed. While closing projects can include closing and (in some situations) deleting projects, the closed projects in these data below refer to those closed and sent back to test status.
  • Top languages refer to most spoken or signed languages in the world; current metrics use the top 20 most-spoken languages as of the year 2023.
  • Threatened languages refer to "vulnerable", "definitely endangered", "severely endangered", and "critically endangered" languages, as defined by UNESCO.[7]

Metrics: state of languages[edit]

Facet Which languages Metric Location(s)
Representation across CPSLEs: hosted All Number of languages with >1 hosted CPSLE notebook (see "Metrics")
Number of languages with # hosted CPSLEs
Representation across CPSLEs: hosted or test All Number of languages with >1 hosted or test CPSLE notebook (see "Metrics")
Number of languages with # hosted or test CPSLEs
Representation across multilingual projects All Number of languages with Wikidata lexemes tbd
Number of languages with captions/descriptions on Commons
Wikipedia representation All Number of languages with hosted Wikipedia notebook (see "Metrics: project-level")
Wikisource representation All Number of languages with hosted Wikisource notebook (see "Metrics: project-level")
Wiktionary/Wikidata lexeme representation All Number of languages with hosted Wiktionary or lexemes on Wikidata tbd
Commons representation All Number of languages with representation (i.e. description or captions) on Commons tbd
MinT coverage All Number of languages with MiNT translatability see MinT.yaml for list; notebook tbd
Content translation coverage All Number of languages (or language editions) with content translation tool capabilities tbd
Script coverage All Number of languages with script supported on Wikimedia projects see langdb.yaml for list; notebook tbd

Metrics: state of top languages[edit]

Facet Which languages Metric Location(s)
Representation across CPSLEs: hosted Top Number of top languages with # number hosted CPSLEs notebook (see "Metrics")
Number of top languages with all 8 hosted CPSLEs
Representation across CPSLEs: hosted or test Top Number of top languages with # hosted or test CPSLEs tbd
Number of top languages with all 8 hosted or test CPSLEs
Representation across multilingual projects Top Number of top languages with Wikidata lexemes tbd
Number of top languages with captions/descriptions on Commons
Wikipedia representation Top Number of top languages with hosted Wikipedia notebook (see "Metrics: project-level")
Wikisource representation Top Number of top languages with hosted Wikisource notebook (see "Metrics: project-level")
Wiktionary/Wikidata lexeme representation Top Number of top languages with hosted Wiktionary or lexemes on Wikidata tbd
Commons representation Top Number of top languages with representation (i.e. description or captions) on Commons tbd

Metrics: state of threatened languages[edit]

Facet Which languages Metric Location(s)
Representation across CPSLEs: hosted Threatened Number of threatened languages with >1 hosted CPSLE tbd
Number of threatened languages with # hosted CPSLEs
Representation across CPSLEs: hosted or test Threatened Number of threatened languages with >1 hosted or test CPSLE tbd
Number of threatened languages with # hosted or test CPSLEs
Wikisource representation Threatened Number of threatened languages with hosted or test Wikisource tbd
Wiktionary/Wikidata lexeme representation Threatened Number of threatened languages with hosted Wiktionary or lexemes on Wikidata tbd
MinT representation Threatened Number of world's threatened languages with MinT translatability tbd

Metrics: language-specific[edit]

Facet Which languages Metric

i.e., for every language, we will calculate...

Location(s)
Project coverage counts Per language Number of hosted CPSLEs the language has notebook
Number of test CPSLEs the language has
Number of closed CPSLEs the language has
Project coverage statuses Per language Status of Wikibooks for the language (e.g., hosted, test, closed, none) dataset, visualization
Status of Wikinews for the language (e.g., hosted, test, closed, none)
Status of Wikipedia for the language (e.g., hosted, test, closed, none)
Status of Wikiquote for the language (e.g., hosted, test, closed, none)
Status of Wikisource for the language (e.g., hosted, test, closed, none)
Status of Wiktionary for the language (e.g., hosted, test, closed, none)
Status of Wikiversity for the language (e.g., hosted, test, closed, none)
Status of Wikivoyage for the language (e.g., hosted, test, closed, none)
Status of Wikidata lexemes for the language (e.g., present, absent) tbd
Status of Commons descriptions/captions for the language (e.g., present, absent)
Wikipedia Per language Ratio of active editors to speakers of the language notebook
Ratio of articles to speakers of the language notebook
Ratio of unique devices to speakers of the language notebook
Ratio of pageviews to speakers of the language tbd
Wikisource Per language tbd tbd
Wiktionary/ Wikidata lexemes Per language tbd tbd
Commons Per language tbd tbd

Metrics: Incubator[edit]

Coming soon

Insights (in progress)[edit]

Note: All insights in this section are in progress. Data presented in the visualizations are still undergoing QA; as such, visualizations should be interpreted and treated as non-finalized.

Overall[edit]

As of December 2023, 329 languages have at least one edition of a hosted Wikimedia content project (i.e., a Wikipedia, Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity, and/or Wikivoyage). "Hosted" means that the project has its own domain/URL (e.g., en.wikipedia.org) as opposed to living in the Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource. Twelve languages have all 8 possible hosted content projects; those languages are Chinese, English, Finnish, French, German, Greek, Italian, Japanese, Portuguese, Russian, Spanish, and Swedish.



As of December 2023, an estimated 1,076 languages have at least one edition of a Wikimedia content project (i.e., a Wikipedia, Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity, and/or Wikivoyage), either hosted or in test. "Hosted" means that the projects has its own domain/URL (e.g., en.wikipedia.org), while "test" refers to project that are hosted within Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource. 45 languages are represented within all 8 possible content projects, including some combination of hosted and test projects.



As of December 2023, there were many more test projects compared to hosted projects. The Wikipedia, Wiktionary, Wikivoyage, Wikisource, Wikinews, and Wikiversity projects all had more editions in test than were hosted. For instance, while there were 326 hosted Wikipedia editions, there were an estimated 698 Wikipedia editions in the Incubator (including 13 that were previously hosted and then closed). And while there were 168 hosted Wiktionary editions, there were an estimated 271 Wiktionary editions in the Incubator (including 23 that were previously hosted and then closed).


Wikipedia[edit]

There is a wide range of Wikipedia sizes for the world's top most spoken languages,[8] ranging from more than 6.7 million articles (English Wikipedia) to 1,200 articles (Nigerian Pidgin Wikipedia).

How do Wikipedia sizes compare to "language sizes" (i.e., the number of speakers of a language, including first-language speakers and speakers for whom the language is a second or other language)? Results of exploratory plotting suggests that Wikipedia’s coverage is largely unrepresentative of the world's language populations. For instance, while Indonesian has 33% more speakers than German, Indonesian Wikipedia has 325% fewer articles than German Wikipedia.

It is worth noting that some low article counts can be attributed to a Wikipedia edition's "age". For instance, Nigerian Pidgin Wikipedia has the lowest article count; but it is also the "youngest" Wikipedia of these 20, having "graduated" from the Incubator in August 2022.[9] Future analyses and visualizations will control for project age.

Incubator[edit]

As of April 2024, a total of 217 projects have graduated from the Incubator. About 1,350 test projects live in the Incubator currently;[10] of these, about 735 are active or substantial.

Current projects[edit]

The number of days that current Incubator projects have been in the Incubator varies widely, from 16 days to more than 6,400 days (>17 years). The visualization below shows the number of days that current projects have been in the Incubator. Each dot represents a project.

  • Overall, the the average number of days current projects have been in the Incubator is 2933.5 days (or about 8 years); the median is 2957 days (or about 8 years and 1 month).
  • The median number of days that current Wikipedia test projects have been in the Incubator is 3092.5 days (or about 8 years and 5 months).
  • Wikibooks test projects have the lowest median number of days spent in the Incubator (2362 days, or about 6 years and 5 months).
  • Wikinews test projects have the highest median (3588 days, or about 9 years and 10 months).

Time-to-graduation[edit]

The number of days that graduated Incubator projects have spent in the Incubator before graduating varies widely, from 8 days to more than 5,300 days (>14 years). The visualization below shows the number of days that Incubator graduates spent in the Incubator. Each dot represents a project.

  • Overall, the average number of days that Incubator graduates spent in the Incubator is 1614.5 days (or 4 years and 5 months); the median is 836 days (or 2 years and 3 months).
  • The median number of days that graduated Wikipedia projects spend in the Incubator is 975.5 days (or about 2 years and 8 months).
  • Graduated Wikivoyage projects have the lowest median number of days spent in the Incubator (428 days, or about 1 year and 2 months).
  • Graduated Wiktionary projects have the highest median (1238.5 days, or about 3 years and 5 months).

Is time-to-graduation improving? (i.e., how long did it take test projects started in earlier years to graduate vs. projects started in more recent years?) The visualization below shows percentages of projects started each year that took 0-1 year to graduate, 1-2 years to graduate, etc. Darker green means fewer years to graduate; so the more dark greens a bar has, the higher the percentage of quickly-graduating projects were started that year.

  • The majority of graduated projects that started in 2007-2008, 2013-2014, and 2016-2018 graduated in less than 5 years
  • The majority of graduated projects that started in 2009-2012 and 2015 took more than 5 years to graduate
  • Graduated projects that started in 2007, 2016, and 2018 had the quickest time-to-graduation
  • (Projects that started between 2019-2024 were excluded from the visualization below because they still have time to potentially graduate in less than 5 years)


Because the visualization above only looks at graduated projects, it is important to consider these data within the context of projects that could have and have not graduated yet. The visualization below presents the same data as above, but adds in projects that haven't graduated yet and combines them (in white) with projects that didn't graduate in less than 5 years.

  • Test projects starting in 2008 year had the highest % of projects graduating in 5 years or fewer.
  • Test projects starting in 2007, 2009, and 2018 also had a high % of projects graduating in 5 years or fewer, compared to other years.
  • Test projects starting in 2010-2011 and 2015-2017 had the lowest % of projects graduating in 5 years or fewer.

Graduation rates[edit]

Does having a preceding project that's already graduated affect likelihood of Incubator graduation? (E.g., is Pa'O Wiktionary more likely to graduate because the prior project in that language, Pa'O Wikipedia, already graduate in 2022?) In comparing graduation rates for projects with and without a preceding project that's already graduated, the following trend was shown for Wikibooks, Wikinews, Wikiquote, Wikivoyage, and Wiktionary (shown in the two visualizations below): a higher percentage of test projects graduated when their preceding test project had already graduated. Additional research could shed light on whether this trend is indicative of prior experience with successful projects having a positive effect on graduation rates of subsequent test projects.

Why is Wikipedia different? (i.e., why are percentages roughly equal for Wikipedias with and without a preceding project that's already graduated?) It could be because Wikipedias are usually the first test project for a language. Of all the Wikipedia projects that have gone through (or are still in) the Incubator, 756 are the first project for a language, and 20 are the second project; and there are no Wikipedias that are the 3rd, 4th, etc.). It could also be that newer projects are more likely to be first-time new language projects--and because they are newer, they are also less likely to have graduated yet.

References[edit]

  1. "Pillar 1: How We Are Working Toward Knowledge Equity". 2021-2022 Annual Report. Wikimedia Foundation.
  2. Per canonical-data/wiki/wikis.tsv 30 May 2023
  3. Percentage based on current estimates of about 7,000 living languages provided by Ethnologue and Linguistic Society of America.
  4. https://meta.wikimedia.org/wiki/Mission
  5. a b Global Education Monitoring Report Team. "If you don't understand, how can you learn?" February 2016. Policy Paper 24. UNSECO.
  6. Wikimedia Movement Strategy: Direction. https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017/Direction
  7. "Atlas of the World's Languages in Danger". unesdoc.unesco.org. UNESCO. 2011. p. 6. Retrieved 2024-03-28. Vulnerable - Most children speak the language, but it may be restricted to certain domains (e.g., home); Definitely endangered - Children no longer learn the language as mother tongue in the home; Severely endangered - Language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves; Critically endangered - The youngest speakers are grandparents and older, and they speak the language partially and infrequently 
  8. Eberhard, David M., Gary F. Simons, and Charles D. Fennig (eds.). 2023. Ethnologue: Languages of the World. Twenty-sixth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.
  9. [[1]] 2021-2022 Annual Report. Wikimedia Foundation.
  10. "03_wrangled_data/projects.tsv". GitLab. 2024-04-09. Retrieved 2024-04-11.