Research:Incubator and language representation across Wikimedia projects
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
The Wikimedia Foundation supports more languages than any other large online platform.[1] Currently, more than 320 languages have one or more editions of open-content Wikimedia projects.[2] However, another way of looking at that statistic is: currently, fewer than 5% of the world’s living languages have at least one edition of a Wikimedia project.[3] How can Wikimedians create new language editions? Via Wikimedia Incubator.
Goals of this project:
- Develop metrics for the state of languages at Wikimedia
- Develop metrics for better understanding Incubator
- Develop knowledge gaps metrics for measuring language gaps
Progress on this project can be followed at T348246
Background
[edit]Why language?
[edit]Language representation is connected to the Foundation's mission and strategic goals as well as the Research team's Knowledge Gaps project.
The Foundation’s mission “is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally”.[4] UNESCO estimates that as much as 40% of the global population does not have access to education in a language they speak or understand.[5] If effective educational content dissemination is central to the Foundation’s mission and receiving instruction in one’s home language has an across-the-board positive impact on learning,[5] then language representation is of central concern to the Foundation’s mission.
The Research team's Knowledge Gaps project draws from the Wikimedia movement's strategic goal of supporting “the knowledge and communities that have been left out by structures of power and privilege”.[6] In 2018-2019, the Research team began to advance knowledge equity with a research program to address knowledge gaps. The project aims "to deliver citable, peer-reviewed knowledge and new technology in order to generate baseline data on the diversity of the Wikimedia contributor population, understand reader needs across languages, remove barriers for contribution by underrepresented groups, and help contributors identify and expand missing content across languages and topics."[7] One of the representation gaps identified in the knowledge gaps taxonomy (for readers, content, and contributors) is language.
In order to understand language representation across Wikimedia projects, we would like to understand the following:
- What is the state of languages at Wikimedia? What is our global coverage? (RQ1)
- Are we reaching as many people as possible, by hosting content in the world’s biggest languages? (RQ2)
- Are we serving the members of marginalized language communities? (RQ3)
Why Incubator?
[edit]The journey through Wikimedia Incubator is a key step for language communities to engage in new Wikimedia projects, and thereby new forms of knowledge in their languages. That is because Wikimedia Incubator is where new-language versions of Wikipedia, Wiktionary, Wikibooks, Wikivoyage, Wikiquote, and Wikinews are arranged, written, and tested before Wikimedia hosting. (Note: New-language versions of Wikiversity go to Beta Wikiversity, and new-language versions of Wikisource go to Multilingual Wikisource). Once an Incubator project is deemed ready for Wikimedia hosting, it “graduates” from the Incubator and receives its own domain. The Language Committee determines whether the project is worthy of graduating by assessing its meeting of the basic requirements and guidelines for Incubator projects. Upon graduating, the project’s content is then exported to its new domain.This process occurs manually and can take multiple days, or sometimes weeks. (*Note: Incubator was created in June 2006. Before Incubator’s creation in June 2006, test projects were launched and edited on Meta-Wiki).
Obstacles related to Incubator have been identified by multiple stakeholders in the past, including Peter Gallert in his 2018 Wikimania talk, "Wikipedia for Indigenous Communities", and Amir Aharoni in his 2020 Celtic Knot talk, "How We Can Make the Incubator Better." These obstacles include: learning curve; prefixes; visual editor problems; incomplete Wikidata support; no content translation; no specific wikistats; lack of automation; and configuration inequity.
Given these challenges, what we would like to understand is:
- What is the state of past and present projects in Incubator, and what implications does it have for language equity? (RQ4)
- What obstacles prevent successful and/or timely graduation from the Incubator? Do these obstacles affect certain languages more than others? (RQ5)
Research questions
[edit](RQ1) What is the state of languages at Wikimedia?
- which of the world's language speakers/readers have access to educational content via Wikimedia projects
- which languages have (0,1,2,3,etc.) Wikimedia projects
- which languages have test projects in Incubator, Wikversity Beta, or Multilingual Wikisource
(RQ2) Are we reaching as many people as possible, by hosting content in the world’s biggest languages?
- which of the world’s top languages have (0,1,2,3,etc.) Wikimedia projects
- which of the world’s top languages have a Wikipedia, Wiktionary, etc.
- which of the world’s top languages have test projects in Incubator, Wikversity Beta, or Multilingual Wikisource
(RQ3) Are we serving the members of marginalized language communities?
- which of the world’s minority and/or threatened languages with at least (0,1,2,3,etc.) Wikimedia project
- which of the world’s minority and/or threatened languages have a Wikipedia, Wiktionary/Wikidata lexemes, Wikisource
(RQ4) What is the state of past and present projects in Incubator, and what implications does it have for language equity?
- Length of time an Incubator project spends in Incubator
- Length of time between an Incubator project's creation and first meaningful edits
- Length of time between an Incubator project's last meaningful edits and Incubator graduation
- Variables predictive of time spent in Incubator; and/or predictive of successful graduation from Incubator
(RQ5) What obstacles prevent successful and/or timely graduation from the Incubator? Do these obstacles affect certain languages more than others?
These questions are being addressed as part of the Wikimedia Language engineering/Incubator conversations.
Data sources
[edit]This project uses data from multiple sources:
- canonical_data.wikis
- Incubator:Site_creation_log
- Incubator:Test_wikis/status/rejected
- meta:Proposals_for_closing_projects/Closed_proposals
- meta:Proposals_for_closing_projects/Archive
- ISO 639-3 Code Set
- Wikisource:Languages
- Wikiversity Beta:States of Wikiversities
- wmf.mediawiki_history
Additional details about these sources are provided in the source data folder of this project's GitLab repo.
As the list above shows, data related to Incubator projects and the Incubator process live in multiple places. In the table below, the Incubator process is outlined with links to the references data sources.
Table. Incubator process with linked data sources
Step 1. Request for new language version of a Wikipedia, Wiktionary, Wikibooks, Wikiquote, Wikinews, or Wikivoyage.
Language must have an ISO code. | ||
Step 2.a. Request approved
(Continue to Step 3.) |
Step 2.b. Request denied | |
Step 3. Test wiki created in Incubator
These wikis don’t have their own domain (they live under incubator.wikimedia.org, with the URL formatting of incubator.wikimedia.org/W[a-z]/[a-z]) and they don’t have their own database name (they live within incubatorwiki) | ||
Step 4. Test wiki is edited by volunteers | ||
Step 5. Request for approval from Language Committee | ||
Step 6.a. Language Committee deems criteria met for Wikimedia hosting, including finding an expert to validate content
(Continue to Step 7.) |
Step 6.b.Language Committee deems criteria not met for Wikimedia hosting
(Go back to Step 4.) | |
Step 7. Test wiki graduates from Incubator
Its content is copied and migrated to its own domain. | ||
Step 8. Wiki lives in the real world!
These wikis now have their own URL/domain and their own database name (e.g., af.wikipedia.org; e.g., afwiki) | ||
(Step 9.a. No closure requested)
(Continue to Step 10.) |
Step 9.b Closure requested | |
Step 9.b.i. Closure request denied
(Continue to Step 10.) |
Step 9.b.ii.Closure request approved
(Return to Step 3.) | |
Step 10. Wiki lives in the real world forever… |
Metrics
[edit]Below are proposed quantitative metrics related to the state of languages across Wikimedia projects. This list of metrics is in progress.
Regarding some nomenclature in the tables below:
- CPSLEs refer to the eight content projects specialized by linguistic edition: Wikipedia, Wiktionary, Wikinews, Wikivoyage, Wikiquote, Wikiversity, Wikisource, and Wikibooks.
- Hosted projects refer to Wikimedia projects that are hosted by the Wikimedia Foundation and have their own domain (e.g., fr.wikipedia.org).
- Test projects refer to content project editions in the Wikimedia Incubator (for Wikibooks, Wikinews, Wikipedia, Wikiquote, Wikivoyage, and Wiktionary), Wikiversity Beta, and Multilingual Wikisource.
- Closed projects refer to Wikimedia projects that were previously hosted by the Wikimedia Foundation (with their own domain) but which have since been closed. While closing projects can include closing and (in some situations) deleting projects, the closed projects in these data below refer to those closed and sent back to test status.
- Top languages refer to most spoken or signed languages in the world; current metrics use the top 20 most-spoken languages as of the year 2023.
- Threatened languages refer to "vulnerable", "definitely endangered", "severely endangered", and "critically endangered" languages, as defined by UNESCO.[8]
Metrics: state of languages
[edit]Facet | Which languages | Metric | Location(s) |
---|---|---|---|
Representation across CPSLEs: hosted | All | Number of languages with >1 hosted CPSLE | notebook (see "Metrics") |
Number of languages with # hosted CPSLEs | |||
Representation across CPSLEs: hosted or test | All | Number of languages with >1 hosted or test CPSLE | notebook (see "Metrics") |
Number of languages with # hosted or test CPSLEs | |||
Representation across multilingual projects | All | Number of languages with Wikidata lexemes | tbd |
Number of languages with captions/descriptions on Commons | |||
Wikipedia representation | All | Number of languages with hosted Wikipedia | notebook (see "Metrics: project-level") |
Wikisource representation | All | Number of languages with hosted Wikisource | notebook (see "Metrics: project-level") |
Wiktionary/Wikidata lexeme representation | All | Number of languages with hosted Wiktionary or lexemes on Wikidata | tbd |
Commons representation | All | Number of languages with representation (i.e. description or captions) on Commons | tbd |
MinT coverage | All | Number of languages with MiNT translatability | see MinT.yaml for list; notebook tbd |
Content translation coverage | All | Number of languages (or language editions) with content translation tool capabilities | tbd |
Script coverage | All | Number of languages with script supported on Wikimedia projects | see langdb.yaml for list; notebook tbd |
Metrics: state of top languages
[edit]Facet | Which languages | Metric | Location(s) |
---|---|---|---|
Representation across CPSLEs: hosted | Top 20,
Top 200 |
Number of top languages with # number hosted CPSLEs | Top 20: notebook (see "Metrics")
Top 200: tbd |
Number of top languages with all 8 hosted CPSLEs | |||
Representation across CPSLEs: hosted or test | Top 20,
Top 200 |
Number of top languages with # hosted or test CPSLEs | tbd |
Number of top languages with all 8 hosted or test CPSLEs | |||
Representation across multilingual projects | Top 20,
Top 200 |
Number of top languages with Wikidata lexemes | tbd |
Number of top languages with captions/descriptions on Commons | |||
Wikipedia representation | Top 20,
Top 200 |
Number of top languages with hosted Wikipedia | Top 20: notebook (see "Metrics: project-level")
Top 200: tbd |
Wikisource representation | Top 20,
Top 200 |
Number of top languages with hosted Wikisource | Top 20: notebook (see "Metrics: project-level")
Top 200: tbd |
Wiktionary/Wikidata lexeme representation | Top 20,
Top 200 |
Number of top languages with hosted Wiktionary or lexemes on Wikidata | tbd |
Commons representation | Top 20,
Top 200 |
Number of top languages with representation (i.e. description or captions) on Commons | tbd |
Metrics: state of threatened languages
[edit]Facet | Which languages | Metric | Location(s) |
---|---|---|---|
Representation across CPSLEs: hosted | Threatened | Number of threatened languages with >1 hosted CPSLE | tbd |
Number of threatened languages with # hosted CPSLEs | |||
Representation across CPSLEs: hosted or test | Threatened | Number of threatened languages with >1 hosted or test CPSLE | tbd |
Number of threatened languages with # hosted or test CPSLEs | |||
Wikisource representation | Threatened | Number of threatened languages with hosted or test Wikisource | tbd |
Wiktionary/Wikidata lexeme representation | Threatened | Number of threatened languages with hosted Wiktionary or lexemes on Wikidata | tbd |
MinT representation | Threatened | Number of world's threatened languages with MinT translatability | tbd |
Metrics: language-specific
[edit]Facet | Which languages | Metric
i.e., for every language, we will calculate... |
Location(s) |
---|---|---|---|
Project coverage counts | Per language | Number of hosted CPSLEs the language has | notebook |
Number of test CPSLEs the language has | |||
Number of closed CPSLEs the language has | |||
Project coverage statuses | Per language | Status of Wikibooks for the language (e.g., hosted, test, closed, none) | dataset, visualization |
Status of Wikinews for the language (e.g., hosted, test, closed, none) | |||
Status of Wikipedia for the language (e.g., hosted, test, closed, none) | |||
Status of Wikiquote for the language (e.g., hosted, test, closed, none) | |||
Status of Wikisource for the language (e.g., hosted, test, closed, none) | |||
Status of Wiktionary for the language (e.g., hosted, test, closed, none) | |||
Status of Wikiversity for the language (e.g., hosted, test, closed, none) | |||
Status of Wikivoyage for the language (e.g., hosted, test, closed, none) | |||
Status of Wikidata lexemes for the language (e.g., present, absent) | tbd | ||
Status of Commons descriptions/captions for the language (e.g., present, absent) | |||
Wikipedia | Per language | Ratio of active editors to speakers of the language | notebook |
Ratio of articles to speakers of the language | notebook | ||
Ratio of unique devices to speakers of the language | notebook | ||
Ratio of pageviews to speakers of the language | tbd | ||
Wikisource | Per language | tbd | tbd |
Wiktionary/ Wikidata lexemes | Per language | tbd | tbd |
Commons | Per language | tbd | tbd |
Metrics: Incubator
[edit]Coming soon
Insights (in progress)
[edit]Note: All insights in this section are in progress. Data presented in the visualizations are still undergoing QA; as such, visualizations should be interpreted and treated as non-finalized.
Overall
[edit]As of April 2024, 332 languages have at least one edition of a hosted Wikimedia content project (i.e., a Wikipedia, Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity, and/or Wikivoyage). "Hosted" means that the project has its own domain/URL (e.g., en.wikipedia.org) as opposed to living in the Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource. Twelve languages have all 8 possible hosted content projects; those languages are Chinese, English, Finnish, French, German, Greek, Italian, Japanese, Portuguese, Russian, Spanish, and Swedish.
As of April 2014, an estimated 1,099 languages have at least one edition of a Wikimedia content project (i.e., a Wikipedia, Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity, and/or Wikivoyage), either hosted or in test. "Hosted" means that the projects has its own domain/URL (e.g., en.wikipedia.org), while "test" refers to pre-hosted projects located in the Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource. 45 languages are represented within all 8 possible content projects, including some combination of hosted and test projects.
As of April 2024, there were many more test projects compared to hosted projects. The Wikipedia, Wiktionary, Wikivoyage, Wikisource, Wikinews, and Wikiversity projects all had more editions in test than were hosted. For instance, while there were 329 hosted Wikipedia editions, there were more than 700 Wikipedia editions in the Incubator (including 13 that were previously hosted and then closed). And while there were 169 hosted Wiktionary editions, there were an estimated 280 Wiktionary editions in the Incubator (including 24 that were previously hosted and then closed).
Wikipedia
[edit]There is a wide range of Wikipedia sizes for the world's top most spoken languages,[9] ranging from approximately 1,400 articles (Nigerian Pidgin Wikipedia) to more than 6.8 million articles (English Wikipedia).
How do Wikipedia sizes compare to "language sizes" (i.e., the number of speakers of a language, including first-language speakers and speakers for whom the language is a second or other language)? Results of exploratory plotting suggests that Wikipedia’s coverage is largely unrepresentative of the world's language populations. For instance, while Indonesian has 33% more speakers than German, Indonesian Wikipedia has 325% fewer articles than German Wikipedia.
It is worth noting that some low article counts can be attributed to a Wikipedia edition's "age". For instance, Nigerian Pidgin Wikipedia has the lowest article count; but it is also the "youngest" Wikipedia of these 20, having "graduated" from the Incubator in August 2022.[10] Future analyses and visualizations will control for project age.
Incubator
[edit]As of April 2024, a total of 217 projects have graduated from the Incubator. About 1,350 test projects live in the Incubator currently;[11] of these, about 735 are active or substantial.
Current projects
[edit]The number of days that current Incubator projects have been in the Incubator varies widely, from 16 days to more than 6,400 days (>17 years). The visualization below shows the number of days that current projects have been in the Incubator. Each dot represents a project.
- Overall, the the average number of days current projects have been in the Incubator is 2933.5 days (or about 8 years); the median is 2957 days (or about 8 years and 1 month).
- The median number of days that current Wikipedia test projects have been in the Incubator is 3092.5 days (or about 8 years and 5 months).
- Wikibooks test projects have the lowest median number of days spent in the Incubator (2362 days, or about 6 years and 5 months).
- Wikinews test projects have the highest median (3588 days, or about 9 years and 10 months).
Time-to-graduation
[edit]The number of days that graduated Incubator projects have spent in the Incubator before graduating varies widely, from 8 days to more than 5,300 days (>14 years). The visualization below shows the number of days that Incubator graduates spent in the Incubator. Each dot represents a project.
- Overall, the average number of days that Incubator graduates spent in the Incubator is 1614.5 days (or 4 years and 5 months); the median is 836 days (or 2 years and 3 months).
- The median number of days that graduated Wikipedia projects spend in the Incubator is 975.5 days (or about 2 years and 8 months).
- Graduated Wikivoyage projects have the lowest median number of days spent in the Incubator (428 days, or about 1 year and 2 months).
- Graduated Wiktionary projects have the highest median (1238.5 days, or about 3 years and 5 months).
Is time-to-graduation improving? (i.e., how long did it take test projects started in earlier years to graduate vs. projects started in more recent years?) The visualization below shows percentages of projects started each year that took 0-1 year to graduate, 1-2 years to graduate, etc. Darker green means fewer years to graduate; so the more dark greens a bar has, the higher the percentage of quickly-graduating projects were started that year.
- The majority of graduated projects that started in 2007-2008, 2013-2014, and 2016-2018 graduated in less than 5 years
- The majority of graduated projects that started in 2009-2012 and 2015 took more than 5 years to graduate
- Graduated projects that started in 2007, 2016, and 2018 had the quickest time-to-graduation
- (Projects that started between 2019-2024 were excluded from the visualization below because they still have time to potentially graduate in less than 5 years)
Because the visualization above only looks at graduated projects, it is important to consider these data within the context of projects that could have and have not graduated yet. The visualization below presents the same data as above, but adds in projects that haven't graduated yet and combines them (in white) with projects that didn't graduate in less than 5 years.
- Test projects starting in 2008 year had the highest % of projects graduating in 5 years or fewer.
- Test projects starting in 2007, 2009, and 2018 also had a high % of projects graduating in 5 years or fewer, compared to other years.
- Test projects starting in 2010-2011 and 2015-2017 had the lowest % of projects graduating in 5 years or fewer.
Graduation rates
[edit]Does having a preceding project that's already graduated affect likelihood of Incubator graduation? (E.g., is Pa'O Wiktionary more likely to graduate because the prior project in that language, Pa'O Wikipedia, already graduate in 2022?) In comparing graduation rates for projects with and without a preceding project that's already graduated, the following trend was shown for Wikibooks, Wikinews, Wikiquote, Wikivoyage, and Wiktionary (shown in the two visualizations below): a higher percentage of test projects graduated when their preceding test project had already graduated. Additional research could shed light on whether this trend is indicative of prior experience with successful projects having a positive effect on graduation rates of subsequent test projects.
Why is Wikipedia different? (i.e., why are percentages roughly equal for Wikipedias with and without a preceding project that's already graduated?) It could be because Wikipedias are usually the first test project for a language. Of all the Wikipedia projects that have gone through (or are still in) the Incubator, 756 are the first project for a language, and 20 are the second project; and there are no Wikipedias that are the 3rd, 4th, etc.). It could also be that newer projects are more likely to be first-time new language projects--and because they are newer, they are also less likely to have graduated yet.
References
[edit]- ↑ "Pillar 1: How We Are Working Toward Knowledge Equity". Wikimedia Foundation (in en-US). Retrieved 2024-06-14.
- ↑ "canonical-data/wiki/wikis.tsv at master · wikimedia-research/canonical-data". GitHub. Retrieved 2024-06-14.
- ↑ Percentage based on current estimates of about 7,000 living languages provided by Ethnologue and Linguistic Society of America.
- ↑ "Mission". meta.wikimedia.org. Retrieved 2024-06-14.
- ↑ a b "Global Education Monitoring Report: If you don't understand, how can you learn?". UNESCO. February 2016. Retrieved 2024-06-14.
- ↑ "Wikimedia Movement Strategy: Direction". meta.wikimedia.org. Retrieved 2024-06-14.
- ↑ Leila Zia, Isaac Johnson, Bahodir Mansurov, Jonathan Morgan, Miriam Redi, Diego Saez-Trumper, and Dario Taraborelli. 2019. Knowledge Gaps – Wikimedia Research 2030. doi.org/10.6084/m9.figshare.7698245 [CC BY 4.0]
- ↑ "Atlas of the World's Languages in Danger". unesdoc.unesco.org. UNESCO. 2011. p. 6. Retrieved 2024-03-28.
Vulnerable - Most children speak the language, but it may be restricted to certain domains (e.g., home); Definitely endangered - Children no longer learn the language as mother tongue in the home; Severely endangered - Language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves; Critically endangered - The youngest speakers are grandparents and older, and they speak the language partially and infrequently
- ↑ Eberhard, David M., Gary F. Simons, and Charles D. Fennig (eds.). 2023. Ethnologue: Languages of the World. Twenty-sixth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.
- ↑ [[1]] 2021-2022 Annual Report. Wikimedia Foundation.
- ↑ "03_wrangled_data/projects.tsv". GitLab. 2024-04-09. Retrieved 2024-04-11.