Research:Develop a model for text simplification to improve readability of Wikipedia articles/Prioritization for simplification
In this work, we would like to explore options for prioritizing articles for simplification. The rationale is that some articles might benefit more than others. We would like to put together a list of articles where simplified versions would be most useful (say 100-1000 articles).
Method
[edit]For the analysis below, we considered all articles in English Wikipedia from the snapshot 2025-05.
We considered the following characteristics:
* Readability score: Flesch-Kincaid grade level (FKGL) of the text of the lead section.
* Length: Number of characters in the text of the lead section.
* SEW: Article exists in Simple English Wikipedia
* Pageviews: Number of pageviews (during January 2025)
* Edits: Number of edits (during January 2025 * Topic: Topic label of the article from the article-topic model * Templates: Whether it contains one of the maintenance templates: Technical, Confusing
Code to create dataset: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simplification-prioritization_create-data.ipynb
Code to analyze dataset: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simplification-prioritization_analyze-data.ipynb
Option 1: Topics
[edit]Some topics are generally more difficult to read/understand than others. In fact, I found that readability of articles varies substantially across topics when calculating the FKGL for all articles and calculating averages across the set of 64 topics from the article-topic model. This suggests to prioritize articles from topics that are more difficult to read.
- the average readability score for all articles (FKGL): 11.4
- highest readability scores (difficult to read): mostly STEM (e.g. Physics, Mathematics, Medicine&Health) (FKGL>14)
- low readability scores (easy to read): Culture and Geography (e.g. Sports, Music, Geographical) (FKGL<11)
- see below for FKGL for each topic
| Topic | FKGL | |
|---|---|---|
| 0 | Culture.Linguistics | 14.735653 |
| 1 | STEM.Physics | 14.590325 |
| 2 | STEM.Mathematics | 14.327468 |
| 3 | STEM.Medicine_&_Health | 14.180763 |
| 4 | STEM.Libraries_&_Information | 13.888417 |
| 5 | STEM.Computing | 13.616421 |
| 6 | History_and_Society.Society | 13.420960 |
| 7 | STEM.Chemistry | 13.203550 |
| 8 | STEM.Technology | 13.098213 |
| 9 | History_and_Society.Politics_and_government | 13.093833 |
| 10 | History_and_Society.Business_and_economics | 13.045401 |
| 11 | Culture.Media.Software | 13.029624 |
| 12 | History_and_Society.Education | 12.421469 |
| 13 | STEM.Earth_and_environment | 12.319744 |
| 14 | History_and_Society.Military_and_warfare | 12.317690 |
| 15 | Culture.Philosophy_and_religion | 12.155918 |
| 16 | Geography.Regions.Europe.Southern_Europe | 12.019233 |
| 17 | Culture.Performing_arts | 11.941233 |
| 18 | Geography.Regions.Americas.South_America | 11.915420 |
| 19 | STEM.Engineering | 11.869008 |
| 20 | Culture.Internet_culture | 11.839834 |
| 21 | Culture.Media.Television | 11.837642 |
| 22 | STEM.STEM* | 11.834587 |
| 23 | Culture.Media.Entertainment | 11.792396 |
| 24 | Culture.Visual_arts.Fashion | 11.787822 |
| 25 | Culture.Literature | 11.786184 |
| 26 | Geography.Regions.Africa.Eastern_Africa | 11.745605 |
| 27 | Geography.Regions.Asia.Southeast_Asia | 11.719719 |
| 28 | Geography.Regions.Asia.North_Asia | 11.705829 |
| 29 | Geography.Regions.Americas.North_America | 11.674879 |
| 30 | Geography.Regions.Americas.Central_America | 11.662474 |
| 31 | Culture.Visual_arts.Comics_and_Anime | 11.656834 |
| 32 | Culture.Media.Video_games | 11.609778 |
| 33 | Culture.Biography.Women | 11.603536 |
| 34 | Culture.Biography.Biography* | 11.600981 |
| 35 | Geography.Regions.Africa.Southern_Africa | 11.595270 |
| 36 | STEM.Space | 11.591517 |
| 37 | Geography.Regions.Europe.Eastern_Europe | 11.587795 |
| 38 | Geography.Regions.Africa.Africa* | 11.568307 |
| 39 | Geography.Regions.Africa.Western_Africa | 11.456297 |
| 40 | History_and_Society.History | 11.418016 |
| 41 | Geography.Regions.Africa.Northern_Africa | 11.360570 |
| 42 | Culture.Visual_arts.Visual_arts* | 11.307989 |
| 43 | Geography.Regions.Oceania | 11.296372 |
| 44 | Culture.Media.Books | 11.216284 |
| 45 | Geography.Regions.Asia.Asia* | 11.201274 |
| 46 | Geography.Regions.Europe.Europe* | 11.156389 |
| 47 | Geography.Regions.Asia.South_Asia | 11.143322 |
| 48 | Culture.Food_and_drink | 11.135446 |
| 49 | Geography.Regions.Africa.Central_Africa | 11.130081 |
| 50 | Culture.Media.Media* | 11.095900 |
| 51 | Geography.Regions.Asia.West_Asia | 11.056867 |
| 52 | History_and_Society.Transportation | 11.005122 |
| 53 | Culture.Visual_arts.Architecture | 10.997538 |
| 54 | Culture.Media.Radio | 10.934781 |
| 55 | Geography.Regions.Asia.East_Asia | 10.897058 |
| 56 | Culture.Media.Films | 10.859632 |
| 57 | Geography.Regions.Europe.Northern_Europe | 10.821992 |
| 58 | Geography.Regions.Asia.Central_Asia | 10.761797 |
| 59 | Culture.Media.Music | 10.693294 |
| 60 | STEM.Biology | 10.639295 |
| 61 | Geography.Regions.Europe.Western_Europe | 10.635665 |
| 62 | Culture.Sports | 10.628404 |
| 63 | Geography.Geographical | 10.277326 |
Option 2: popular and no edits
[edit]On an article level, we could identify articles for which we might anticipate a need for a simpler version if it is i) difficult to read, and ii) has many pageviews, and iii) has few/no editing activity (i.e. dont seem controversial). Specifically, I consider the following metrics for prioritization
We first filter articles according to the following criteria:
- top-5 percentile in terms of pageviews (5% most popular articles)
- no edits
- a reasonable length (i.e. between 1,000 and 100,000 characters)
- no existing article in Simple English Wikipedia
This yields overall 21,221 articles. From these, one can select different subsets of articles that can be considered difficult to read
- at least slightly difficult to read (FKGL>12): 13,890 articles
- at least difficult to read (FKGL>15): 4,751 articles
- at least very difficult to read (FKGL>18): 1,058 articles
Option 3: Maintenance templates
[edit]Another option is to consider articles which are marked with relevant templates. Specifically, in enwiki I found two templates which could indicate a need for simplification:
- {{Confusing}} "This article may be confusing or unclear to readers."
- {{Technical}}: "This article may be too technical for most readers to understand."
This yields 3,702 articles:
- the average readability (FKGL) of those articles is 14.0. This is substantially higher than the average readability of all articles (11.4)
Examples
[edit]A list of 1,000 example articles from each of the 3 options can be found in this spreadsheet.
Notes
[edit]This prioritization only focuses on the content of articles considering, e.g., the readability of the articles. However, other factors beyond the content likely also affect whether a simplified version is suitable such as the target audience. For example, a survey about the perception of simplified versions of Wikipedia articles indicate that it is very subjective what is considered easy to read depending on, e.g., the level of education of the reader.