Research:Develop a model for text simplification to improve readability of Wikipedia articles/Background literature review

From Meta, a Wikimedia project coordination wiki

This document contains a (probably incomplete) summary of my attempt to review the state-of-the-art on text simplification. This tries to capture academic literature as well as available models. The main focus is on text simplification on the document-level (including paragraph- or section-level) i.e. beyond the approaches focusing on individual sentences.

Main points that were interesting for me:

  • Only few works very recently start to approach document-level simplification
  • Existing works only considered English models / data
  • Good performance on document-level simplification from fine-tuned BART models
  • From related tasks, the recently published mLongT5 model family seems promising way to develop a domain-specific model via fine-tuning because i) the mT5 model has been successfully used for multilingual sentence-level simplification tasks; ii) the LongT5 has been successfully used for summarization tasks
  • Prompt-based general LLMs such as ChatGPT dont seem to be better than the domain-specific models in automatic evaluation (though there has not been that much research been done yet)
  • Our data in 10+ languages from children/simple encyclopedias provides a new multilingual dataset to approach document-level simplification beyond English


What is simplification[edit]

Text simplification aims to automatically modify a piece of text in order to make it easier to read and understand, while retaining the original information and meaning.Text simplification consists of different aspects, which are often grouped into 3 main classes (Gooding 2022[1], Stajner 2021[2]):

  • Lexical simplification aims to reduce the complexity of words; e.g. complex words are replaced with simpler alternatives
  • Syntactic simplification aims to reduce the grammatical complexity by reducing the syntactic structures; e.g., the conversion from passive to active voice.
  • Conceptual simplification aims to simplify ideas or concepts within a text; e.g. providing explanations of terms.

Why do we need text simplification[edit]

The main goal of text simplification is to make information more accessible to readers. There are different populations who could benefit from text simplification (Siddharthan 2014[3])

  • Readers without specialized knowledge, e.g., novice readers in technical domains
  • Low literacy. For example, using OECD adult literacy reports, (Stajner 2021[2]) estimate that “approximately 16.7% of population needs lexical simplification of everyday texts, 50% of population needs syntactic simplification, and 89.4% of population needs conceptual simplification.”
  • Children
  • Non-native speakers
  • Readers with reading disabilities or disabilities more generally, such as deafness, aphasia, or dyslexia, etc

Simplification is also considered important in the medical domain (Stajner 2021[2]). This is especially relevant in the context of Wikipedia when considering that “Wikipedia’s health content is the most frequently visited resource for health information on the internet.” (Smith 2020[4]). For Wikipedia, readability of articles in English Wikipedia has been found to be generally low and, thus, insufficient for its target audience (Lucassen et al. 2012[5]).

Manual simplification is slow and expensive requiring the need for (semi-) automatic text simplification systems.

Ethical considerations[edit]

Ethical considerations in the research on text simplification are covered in detail in (Stajner 2021[2], Gooding 2022[1]) describing limitations as well as risks and harms. These include :

  • Although most works mention “text simplification”, they actually refer to sentence simplification.
  • The state-of-the-art ATS systems published in top tier NLP/CL/AI conferences and are not directed towards any particular simplification transformation or target population
  • Meaning distortion is a problem in any language generation task. It is crucial to ensure the meaning of the text is preserved, especially in the context of, e.g., health information. They also mention that subtleties of meaning intended by the authors may be diluted or lost.
  • Paternalism: the decision of what needs simplifying might not involve the actual user reducing their autonomy. “The relationship between paternalism and assistive technology is widely acknowledged, as design decisions made on behalf of a user can be problematic if they override the autonomy of the individual” (Gooding 2022[1])

Relation to other tasks such as summarization[edit]

  • (Alva-Manchego et al. 2020b[6]) discuss in Section 1.3 tasks that are related to simplification: summarization, sentence compression, split-and-rephrase.
  • (Aumiller&Gertz 2022[7]) argue that the inclusion of summarization into the broader context of Text Simplification is a necessary step towards end-to-end solutions for longer input texts
  • (Blinova et al. 2023[8]) propose a two-stage framework combining summarization and simplification.
  • (Sun et al. 2023a[9]) explores the correlation between summarization and simplification. The main application is to potentially generate more training data for simplification derived from summarization

Task and evaluation[edit]


The typical approach to text simplification has been within the framework of sequence-to-sequence similar to that of machine translation. That is, the input is the original text, and the output is the simplified version of the text.

Text simplification can be divided into two main themes depending on the level on which it is approached.

Sentence-level simplification

The typical approach in text simplification is, in fact, referring to sentence simplification. That is, the text is first split into sentences, and then each sentence is simplified one at a time.

However, processing each sentence separately might not be meaningful in order to simplify a full article (Sun et al. 2021[10]); for example, many simplification edits have been shown to require a context beyond the level of a single sentence (Laban et al. 2023).

Therefore, we will not cover works on sentence simplification in detail here since our use-case is specifically to simplify (parts of) Wikipedia articles (e.g. on the level of paragraphs or sections). Good overviews on the state-of-the-art in sentence-level simplification can be found in (Alva-Manchego et al. 2020b[6], Ruder 2019[11]). In fact, the recent review from (Alva-Manchego et al. 2020b) mentions that: “It can be argued that “true” TS [text simplification] (i.e., document-level) cannot be achieved by simplifying sentences one at a time, and we make a call in Section 6 for the field to move in that direction. However, because the goal of this article is to review what has been done in TS so far, our survey is limited to Sentence Simplification (SS).”

Document-level simplification

Only recently, some works started to approach text simplification beyond the single-sentence level:

  • (Aumiller&Gertz 2022[7]) argue that the inclusion of summarization into the broader context of Text Simplification is a necessary step towards end-to-end solutions for longer input texts. Though they dont actually apply any simplification models to the dataset.
  • (Blinova et al. 2023[8]) proposes document-level summarization with a two-stage framework to combine summarization and simplification
  • (Laban et al. 2021[12]) approaches simplification of paragraphs
  • (Laban et al. 2023[13]) introduces SWiPE, a high-quality and large-scale document-level simplification dataset showing that at least 43% of edits require document-level context
  • (Sun et al. 2021[10]) introduces D-Wikipedia, a document-level simplification dataset.
  • (Sun et al. 2023b[14]) applies recent LLMs such as ChatGPT and Llama on the document-level simplification task.


A good overview on evaluation of text simplification is provided by (Alva-Manchego et al. 2020b[6]).

The evaluation of text simplification is done by comparing the original with the simplified version of the text. There are 3 main criteria to evaluate the output of text simplification

  • Simplicity: measures how much simpler (or easier to understand) the simplified version is.
  • Grammaticality (fluency) assesses whether the simplified versions remains grammatical or understandable.
  • Meaning preservation (adequacy) assesses whether the simplified version share the same meaning.

Human evaluation is considered to be the most reliable method to evaluate simplification is asking human raters. In practice, participants are asked to rate each of the three dimensions on a Likert scale. However, this approach does not scale,

Automatic evaluation: there are different automatic measures that have been proposed as proxies to evaluate the quality of the simplification. Among them, the most common metrics are:

  • SARI: compares the predicted simplified sentences against the reference and the source sentences (system output against references and  against  the input  sentence) focusing on lexical simplification
    • sari = ( F1_add + F1_keep + P_del) / 3
      • F1_add is the n-gram F1 score for add operations
      • F1_keep is the n-gram F1 score for keep operations
      • P_del is the n-gram precision score for delete operations
    • Introduced by (Xu et al. 2016[15]) for sentences claiming “Our SARI metric has highest correlation with human judgments of simplicity”
    • (Laban et al. 2023[13]) uses the standard SARI score for documents (n-gram based). D-SARI is an explicit extension of the SARI score to documents (Sun et al. 2021[10]). Package to calculate:
    • It is the main metric used for evaluating text simplification
  • BLEU: is a metric borrowed from machine translation judging the quality of a text that has been translated (bilingual evaluation understudy). Similar to SARI, it compares the output against a reference and a source.
    • (Xu et al. 2016[15]) report that BLEU exhibits higher correlations on grammaticality and meaning preservation.
    • Also commonly used complementing the SARI metric
  • Other metrics
    • FKGL: measures the simplicity of a text using the flesch-kincaid grading level. The use of this metric to evaluate simplification is discouraged because ungrammatical sentences could get really high scores and thus might not be meaningful. Though it does get reported as additional information.
    • SAMSA: (Sulem et al. 2018b[16]): a metric designed to measure structural simplicity (i.e. sentence splitting). It validates that each simple sentence resulting from splitting a complex one is correctly formed (i.e., it corresponds to a single Scene with all its Participants). Does not require a reference sentence. however, it is not widely adopted.
    • BERTscore: this is introduced ad-hoc in (Alva-Manchego et al 2021[17]). It shows good correlation with human judgement but I havent seen this used in other simplification works.
    • Mixtures, see, e.g., (Xu et al. 2016[15]) such as iBLEU (an extension of BLEU), FKBLEU (geometric mean of iBLEI and FK), etc.
  • EASSE is a tool in python that provides access to many of these metrics.
  • A rather critical overview is provided by (Alva-Manchego et al. 2021[17]) highlighting the limitations of commonly used operation-specific metrics: “Overall, we suggest using multiple metrics, and mainly BERTScore Precision, for reference-based evaluation. SARI could be used when the simplification system only executed lexical paraphrasing, and SAMSA may be useful when it is guaranteed that splitting was performed.”
  • The main works in document simplification report these metrics: SARI, D-SARI, BLEU, FGKL


Most benchmark datasets for text simplification are for sentence-level simplification. This means they contain a set of aligned sentences (original vs simplified). The most commonly used datasets are:

  • Sentences extracted/aligned from articles of simple and english wikipedia. This comes in different flavors and has been expanded in different works over the years: PWKP (Zhu et al. 2010[18]), SEW (Coster&Kauchuk 2011[19]), WikiLarge (Zhang&Lapata 2017[20]), Wiki-Auto (Jiang et al. 2020[21]) which contain between 100K and 600K aligned sentences. This data has been criticized by (Xu et al. 2015[22]) finding that alignment is not good or that the supposedly simpler sentence is not necessarily simpler. This work suggested the Newsela dataset as an alternative.
  • Newsela (Xu et al. 2015[22]) contains 1,130 articles in 5 different readability levels. Each article was re-written 4 times by editors for children at different grade-level. It is considered of very high quality. It was introduced as an alternative to the sentence simplification dataset based on data from simple and english Wikipedia (see above). However, the main limitation is that it is not available under an open license.
  • Manually created simplification datasets: TurkCorpus (Xu et al. 2016[15]), HSplit (Sulem et al. 2018a[23]),  ASSET (Alva-Manchego et al. 2020a[24]). For few thousand sentences, simplified versions were generated via crowdsourcing (such as Amazon Mechanical Turk).

Many of these datasets (among others) can be accessed via MultiSim Benchmark (Ryan et al. 2023[25]). A more detailed overview on other datasets (also in other languages) can be found in Sec. 2 of (Alva-Manchego et al. 2020b[6]); we dont list them here since they often dont consist of encyclopedic articles. Though overall, there is few data available in languages other than English.

For document-level simplification, there exist the following resources:

  • D-Wikipedia (Sun et al. 2021). Consists of 143,546 pairs of aligned article from simple and english wikipedia.
    • 2020-08-20 snapshot, Preprocessing of text using wikitextextractor, only lead section, removed pairs where original or simplified text is longer than 1000 words
    • (Blinova et al. 2023) mentions some problems: 30% of pairs the simpler text is longer, there are also some misalignments with articles
  • SWiPE (Laban et al. 2023[13]) consists of 145,161 article pairs from simplewiki and enwiki
    • The main innovation wrt to the D-Wikipedia dataset is that they match not only the articles with the latest revision, but also looking for best-matching revisions of each article. This is claimed to lead to a higher quality dataset as the two original and simplified versions are more aligned with each other.
  • Klexikon (Aumiller&Gertz 2022[7]) consists of 2,898 pairs of aligned articles from German Wikipedia and Klexikon (as the simplified version).
    • they only consider Wikipedia articles with a minimum length of 15 para-graphs, which results in a clear contrast in overall article length of source and simplified texts (e.g. Wikipedia documents having on average 8.94 times more sentences compared to their Klexikon counterparts.).
    • Text extraction from the HTML: <p> tages, Header elements h1-h5
  • Vikidia is a children encyclopedia available in more than 10 languages.
    • (Madrazo Azpiazu&Pera 2020[26]) introduce the VikiWiki dataset of 448 articles in two levels (Wikipedia and Vikidia) and 6 languages (English, Spanish, French, Italian, Catalan, and Basque). However, the publicly available does not provide alignment of individual article pairs between Wikipedia and Vikidia.
    • (Lee&Vajjala 2022[27]) introduce the vikidia en/fr bilingual dataset which consists of 6165 articles available in two readability levels (Wikipedia and Vikidia) for English as well as French. In the paper, the data is only used for readability assessment and not for simplification. The authors dont provide a lot of details how they matched and processed the articles though

Inspired by the list of wikis for children, in the project on measuring readability we collected the document-aligned datasets from Wikipedia and the simplified version from children/simplified encyclopedia. Those are not yet published so they havent been used in any other works. They include the following data (number of pairs of aligned articles):

  • Simple Wikipedia:
    • English (en): 109,152
  • Vikidia:
    • English (en): 1,994
    • Catalan (ca): 244
    • German(de): 273
    • Greek (el): 41
    • Spanish (es): 2,450
    • Basque (eu): 1,059
    • French (fr): 12,675
    • Armenian (hy): 550
    • Italian (it): 1,704
    • Occitan (oc): 7
    • Portuguese (pt): 598
    • Russian (ru): 104
  • Txikipedia
    • Basque (eu): 2,649
  • Wikikids
    • Dutch (nl): 11,319


Which models have been evaluated on some of the benchmark datasets? What is their performance?

(Blinova et al. 2023[8])

  • Data: D-Wikipedia (they also use Wiki-Doc though that is an older dataset from 2013)
  • Metrics: SARI, D-SARI, FKGL
  • Models
  • Results (Table 3)
    • The SimSum (T5) with keyword prompt achieves the top performance in all metrics
      • SARI: 49.44
      • D-Sari: 39.77
      • FKGL: 6.04
  • Notes:
    • It seems each model was fine-tuned with WikiLarge (sentence simplification)

(Laban et al. 2023[13])

  • Data: SWiPE
  • metrics SARI, FKGL
  • models
    • BART-SWiPE(-C). Proposed model by the authors. they are fine-tuned models using a pretrained BART-large model.ACCESS a sentence-level simplification model trained on WikiLarge
    • Keep-it-simple an unsupervised paragraph-level model
    • BART-WikiLarge. A BART-large model trained on WikiLarge
    • GPT3-davinci-003. Prompting GPT3 (without training) “simplify the document …”
  • Results (Figure 5):
    • BART-SWiPE shows best performance
      • SARI: 47
      • FKGL: 7.7
  • Notes
    • Models were trained on the training set of D-Wikipedia
    • All models were trained on the training set of annotations, hyperparameters were selected with validation set, results on in-domain test set.

(Sun et al. 2021[10])

  • Data: D-Wikipedia
  • metrics: SARI, D-SARI, BLEU, FKGL
  • models
    • Transformer. Seq2seq model not further explained.
    • SUC. a previous sentence-based model from the author using contextual information.
    • BertSumextabs. Bert-base model from text summarization referring to Liu and Lapata 2019
    • BART.
  • Results (Table 7):
    • BART shows the highest values for
      • SARI: 48.34
      • BLEU: 31.77
    • BertSumextabs shows highest values for
      • D-SARI: 39.88
  • Notes
    • FKGL-metric is unclear. Some models yield 59 which would be very high for FKGL. Maybe they calculated flesch-reading ease instead?

(Sun et al. 2023b[14])

  • Data: D-Wikipedia
  • metrics: D-SARI
  • models
    • SimpleBART. BART-Large with pretraining on simplification task (training data)
    • Baselines from previous paper in 2021
      • BertSumextabs
      • BART
      • BART-CP
    • Large LLMs
      • GPT-3.5. GPT-3.5-Turbo-0301
      • FLAN-T5. FLAN-T5-XL
      • LlaMA. LlaMA-7B
      • FLAN-T5 (fine-tuned). FLAN-T5-base with fine-tuning as another baseline
  • Results:
    • SimpleBART beats other baselines (Table 3)
      • D-SARI: 41.64
    • Comparison with prompt-based model (Table 8)
      • D-SARI scores from LLMs are much (26-68,26.77, 33.22)
  • Notes
    • For the 3 LLMs, the authors use zero-shot generation
  • Summary/Observations
    • Document-level simplification has only been evaluated in English data.
    • Best model: fine-tuning BART(-large) across datasets and metrics.
    • Prompt-based LLMs such as GPT-3.5 dont seem to be compatible with fine-tuned BART-models (or similar)

Which other models seem promising  which havent been explicitly evaluated in document-level simplification (e.g. sentence-level simplification or summarization etc)?

  • BLOOM.
    • (Ryan et al. 2023[25]) use few-shot prompting for sentence-level simplification showing they outperform fine-tuned models in most languages
  • mLongT5 (Uthus et al. 2023[28]). Is an extention of the mT5-family for longer input sequences. The mT5 family has been shown to work well in multilingual contexts for sentence-level simplification.
    • (Uthus et al. 2023[28]) do not evaluate on simplification but on summarization: “As our report shows, the model is able to perform well on a variety of summarization and question-answering tasks.”
    • (Ryan et al. 2023[25]) evaluate the mT5-Base with fine-tuning for sentence simplification across languages (including zero-shot cross-lingual transfer learning)

Some related existing tools (not necessarily academic works)

  • Wikiwand. wikiwand provides a different layout for wikipedia articles.  For some articles it can generate summarizations (a few bullet points). Those are generated via wordtune.
  • At the Wikimedia Hackathon, one project built a tool to generate automatic summaries of sections or talk pages of articles those are generated by prompting ChatGPT.
  • WikiChatbot is a user-script that provides editors with a tool to generate simplified versions of articles (among other options such as copyediting etc). Those are generated by prompting GPT-3.5.


  1. a b c Gooding, S. (2022). On the Ethical Considerations of Text Simplification. Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), 50–57.
  2. a b c d Stajner, S. (2021). Automatic Text Simplification for Social Good: Progress and Challenges. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2637–2652.
  3. Siddharthan, A. (2014). A survey of research on text simplification. ITL - International Journal of Applied Linguistics, 165(2), 259–298.
  4. Smith, D. A. (2020). Situating Wikipedia as a health information resource in various contexts: A scoping review. PloS One, 15(2), e0228786.
  5. Lucassen, T., Dijkstra, R., & Schraagen, J. M. (2012). Readability of Wikipedia. First Monday.
  6. a b c d Alva-Manchego, F., Scarton, C., & Specia, L. (2020b). Data-driven sentence Simplification: Survey and benchmark. Computational Linguistics (Association for Computational Linguistics), 46(1), 135–187.
  7. a b c Aumiller, D., & Gertz, M. (2022). Klexikon: A German Dataset for Joint Summarization and Simplification. In arXiv [cs.CL]. arXiv.
  8. a b c Blinova, S., Zhou, X., Jaggi, M., Eickhoff, C., & Bahrainian, S. A. (2023). SIMSUM: Document-level Text Simplification via Simultaneous Summarization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9927–9944.
  9. Sun, R., Yang, Z., & Wan, X. (2023a). Exploiting Summarization Data to Help Text Simplification. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 39–51.
  10. a b c d Sun, R., Jin, H., & Wan, X. (2021). Document-Level Text Simplification: Dataset, Criteria and Baseline. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7997–8013.  
  11. Ruder, S. (2019). Simplification. NLP-Progress.
  12. Laban, P., Schnabel, T., Bennett, P., & Hearst, M. A. (2021). Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 6365–6378.
  13. a b c d Laban, P., Vig, J., Kryscinski, W., Joty, S., Xiong, C., & Wu, C.-S. (2023). SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages. In arXiv [cs.CL]. arXiv.
  14. a b Sun, R., Xu, W., & Wan, X. (2023b). Teaching the Pre-trained Model to Generate Simple Texts for Text Simplification. Findings of the Association for Computational Linguistics: ACL 2023, 9345–9355.
  15. a b c d Xu, W., Napoles, C., Pavlick, E., Chen, Q., & Callison-Burch, C. (2016). Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4, 401–415.
  16. Sulem, E., Abend, O., & Rappoport, A. (2018b). Semantic Structural Evaluation for Text Simplification. Proceedings of the 2018 Conference of the North AMerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 685–696.
  17. a b Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (Un)suitability of automatic evaluation metrics for Text Simplification. Computational Linguistics (Association for Computational Linguistics), 47(4), 861–889.  
  18. Zhu, Z., Bernhard, D., & Gurevych, I. (2010). A Monolingual Tree-based Translation Model for Sentence Simplification. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 1353–1361.
  19. Coster, W., & Kauchak, D. (2011). Learning to Simplify Sentences Using Wikipedia. Proceedings of the Workshop on Monolingual Text-To-Text Generation, 1–9.
  20. Zhang, X., & Lapata, M. (2017). Sentence Simplification with Deep Reinforcement Learning. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 584–594.
  21. Jiang, C., Maddela, M., Lan, W., Zhong, Y., & Xu, W. (2020). Neural CRF Model for Sentence Alignment in Text Simplification. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7943–7960.
  22. a b Xu, W., Callison-Burch, C., & Napoles, C. (2015). Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3, 283–297.
  23. Sulem, E., Abend, O., & Rappoport, A. (2018a). BLEU is Not Suitable for the Evaluation of Text Simplification. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 738–744.
  24. Alva-Manchego, F., Martin, L., Bordes, A., Scarton, C., Sagot, B., & Specia, L. (2020a). ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4668–4679.
  25. a b c Ryan, M., Naous, T., & Xu, W. (2023). Revisiting non-English Text Simplification: A Unified Multilingual Benchmark. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4898–4927.
  26. Madrazo Azpiazu, I., & Pera, M. S. (2020). An Analysis of Transfer Learning Methods for Multilingual Readability Assessment. Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, 95–100.
  27. Lee, J., & Vajjala, S. (2022). A Neural Pairwise Ranking Model for Readability Assessment. Findings of the Association for Computational Linguistics: ACL 2022, 3802–3813.
  28. a b Uthus, D., Ontañón, S., Ainslie, J., & Guo, M. (2023). mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences. In arXiv [cs.CL]. arXiv.