Jump to content

Research:Develop a model for text simplification to improve readability of Wikipedia articles/Prioritization for simplification

From Meta, a Wikimedia project coordination wiki

In this work, we would like to explore options for prioritizing articles for simplification. The rationale is that some articles might benefit more than others. We would like to put together a list of articles where simplified versions would be most useful (say 100-1000 articles).

Method

[edit]

For the analysis below, we considered all articles in English Wikipedia from the snapshot 2025-05.

We considered the following characteristics:

* Readability score: Flesch-Kincaid grade level (FKGL) of the text of the lead section.

* Length: Number of characters in the text of the lead section.

* SEW: Article exists in Simple English Wikipedia

* Pageviews: Number of pageviews (during January 2025)

* Edits: Number of edits (during January 2025 * Topic: Topic label of the article from the article-topic model * Templates: Whether it contains one of the maintenance templates: Technical, Confusing

Code to create dataset: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simplification-prioritization_create-data.ipynb

Code to analyze dataset: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simplification-prioritization_analyze-data.ipynb

Option 1: Topics

[edit]

Some topics are generally more difficult to read/understand than others. In fact, I found that readability of articles varies substantially across topics when calculating the FKGL for all articles and calculating averages across the set of 64 topics from the article-topic model. This suggests to prioritize articles from topics that are more difficult to read.

  • the average readability score for all articles (FKGL): 11.4
  • highest readability scores (difficult to read): mostly STEM (e.g. Physics, Mathematics, Medicine&Health) (FKGL>14)
  • low readability scores (easy to read): Culture and Geography (e.g. Sports, Music, Geographical) (FKGL<11)
  • see below for FKGL for each topic
Average FKGL score for each topic
Topic FKGL
0 Culture.Linguistics 14.735653
1 STEM.Physics 14.590325
2 STEM.Mathematics 14.327468
3 STEM.Medicine_&_Health 14.180763
4 STEM.Libraries_&_Information 13.888417
5 STEM.Computing 13.616421
6 History_and_Society.Society 13.420960
7 STEM.Chemistry 13.203550
8 STEM.Technology 13.098213
9 History_and_Society.Politics_and_government 13.093833
10 History_and_Society.Business_and_economics 13.045401
11 Culture.Media.Software 13.029624
12 History_and_Society.Education 12.421469
13 STEM.Earth_and_environment 12.319744
14 History_and_Society.Military_and_warfare 12.317690
15 Culture.Philosophy_and_religion 12.155918
16 Geography.Regions.Europe.Southern_Europe 12.019233
17 Culture.Performing_arts 11.941233
18 Geography.Regions.Americas.South_America 11.915420
19 STEM.Engineering 11.869008
20 Culture.Internet_culture 11.839834
21 Culture.Media.Television 11.837642
22 STEM.STEM* 11.834587
23 Culture.Media.Entertainment 11.792396
24 Culture.Visual_arts.Fashion 11.787822
25 Culture.Literature 11.786184
26 Geography.Regions.Africa.Eastern_Africa 11.745605
27 Geography.Regions.Asia.Southeast_Asia 11.719719
28 Geography.Regions.Asia.North_Asia 11.705829
29 Geography.Regions.Americas.North_America 11.674879
30 Geography.Regions.Americas.Central_America 11.662474
31 Culture.Visual_arts.Comics_and_Anime 11.656834
32 Culture.Media.Video_games 11.609778
33 Culture.Biography.Women 11.603536
34 Culture.Biography.Biography* 11.600981
35 Geography.Regions.Africa.Southern_Africa 11.595270
36 STEM.Space 11.591517
37 Geography.Regions.Europe.Eastern_Europe 11.587795
38 Geography.Regions.Africa.Africa* 11.568307
39 Geography.Regions.Africa.Western_Africa 11.456297
40 History_and_Society.History 11.418016
41 Geography.Regions.Africa.Northern_Africa 11.360570
42 Culture.Visual_arts.Visual_arts* 11.307989
43 Geography.Regions.Oceania 11.296372
44 Culture.Media.Books 11.216284
45 Geography.Regions.Asia.Asia* 11.201274
46 Geography.Regions.Europe.Europe* 11.156389
47 Geography.Regions.Asia.South_Asia 11.143322
48 Culture.Food_and_drink 11.135446
49 Geography.Regions.Africa.Central_Africa 11.130081
50 Culture.Media.Media* 11.095900
51 Geography.Regions.Asia.West_Asia 11.056867
52 History_and_Society.Transportation 11.005122
53 Culture.Visual_arts.Architecture 10.997538
54 Culture.Media.Radio 10.934781
55 Geography.Regions.Asia.East_Asia 10.897058
56 Culture.Media.Films 10.859632
57 Geography.Regions.Europe.Northern_Europe 10.821992
58 Geography.Regions.Asia.Central_Asia 10.761797
59 Culture.Media.Music 10.693294
60 STEM.Biology 10.639295
61 Geography.Regions.Europe.Western_Europe 10.635665
62 Culture.Sports 10.628404
63 Geography.Geographical 10.277326
[edit]

On an article level, we could identify articles for which we might anticipate a need for a simpler version if it is i) difficult to read, and ii) has many pageviews, and iii) has few/no editing activity (i.e. dont seem controversial). Specifically, I consider the following metrics for prioritization

We first filter articles according to the following criteria:

  • top-5 percentile in terms of pageviews (5% most popular articles)
  • no edits
  • a reasonable length (i.e. between 1,000 and 100,000 characters)
  • no existing article in Simple English Wikipedia

This yields overall 21,221 articles. From these, one can select different subsets of articles that can be considered difficult to read

  • at least slightly difficult to read (FKGL>12): 13,890 articles
  • at least difficult to read (FKGL>15): 4,751 articles
  • at least very difficult to read (FKGL>18): 1,058 articles

Option 3: Maintenance templates

[edit]

Another option is to consider articles which are marked with relevant templates. Specifically, in enwiki I found two templates which could indicate a need for simplification:

  • {{Confusing}} "This article may be confusing or unclear to readers."
  • {{Technical}}: "This article may be too technical for most readers to understand."

This yields 3,702 articles:

  • the average readability (FKGL) of those articles is 14.0. This is substantially higher than the average readability of all articles (11.4)

Examples

[edit]

A list of 1,000 example articles from each of the 3 options can be found in this spreadsheet.

Notes

[edit]

This prioritization only focuses on the content of articles considering, e.g., the readability of the articles. However, other factors beyond the content likely also affect whether a simplified version is suitable such as the target audience. For example, a survey about the perception of simplified versions of Wikipedia articles indicate that it is very subjective what is considered easy to read depending on, e.g., the level of education of the reader.