Jump to content

Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification

From Meta, a Wikimedia project coordination wiki

This page captures the work on hypothesis WE.3.1.3 as part of Product & Tech’s Annual Plan for Fiscal year 24-25:

If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.

Current status

[edit]

2024-07-04: set up page

2024-07-15: Identified model requirements for respective tasks

2024-08-05: Testing candidate models

Up Next: Evaluation of candidate models. This includes the identification of easily interpretable metrics whether model output is suitable for the task.

Background

[edit]

One of the objective's in the Annual Plan is around the Reader experience (WE3): A new generation of consumers arrives at Wikipedia to discover a preferred destination for discovering, engaging, and building a lasting connection with encyclopedic content. The goals are to:

  • Retain existing and new generations of consumers and donors.
  • Increase relevance to existing and new generations of consumers by making our content more easy to discover and interact with.
  • Work across platforms to adapt our experiences and existing content, so that encyclopedic content can be explored and curated by and to a new generation of consumers and donors.

As part of the Key Result WE.3.1 towards this goal, we want to explore opportunities for readers to more easily discover and learn from content they are interested in. In this project, we focus on models for simplifying the existing content on Wikipedia.

The Readability Gap:

  • We have shown in previous work that content in Wikipedia is generally very difficult to read. [1] This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults).
  • There are some Wikipedias with articles using a decidedly simpler language such as Simple English Wikipedia or children’s encyclopdias (Vikidia, Txikipedia, Klexikon, Wikikids). However, they exist only in few languages (compared to the more than 300 languages in Wikipedia) and cover only a much smaller number of articles (for example, as of July 2024 Simple English Wikipedia contains around 250K articles vs 6.8M in English Wikipedia)

Automatic Simplification:

  • In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text that could be surfaced to the reader. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task.
  • In previous exploratory work, we showed that its possible to automatically generate simplified versions of text with some success even across languages beyond English

Goals

[edit]
  • [Done] Identify requirements (infrastructure, performance, quality, languages, context, etc)
  • Review candidate models compatible with requirements
  • Implement one or more candidate models

Defining model requirements

[edit]

In order to decide which model to use for the corresponding tasks, I identified the following requirements:

  • Multilingual: The model should support at least some languages other than English; ideally, as many languages as possible from the more than 300 languages in Wikipedia.
  • Openness: The model should be open so we can deploy it as a production service in our own infrastructure (LiftWing). Which definition of open needs to be determined.
  • Resources: We need to be able to host the model in our infrastructure in LiftWing. This sets a limit on the model size (e.g. number of parameters). Additional constraints come from performance, e.g., the time to return results should be limited.
  • Use-case: Does the model have a chance to be effective for the respective task (based on Research and what we know)? Has the model been used for this task or similar tasks before?
  • Quality: The output of the model needs to be useful, e.g. the quality should pass some threshold. This requires some evaluation of the model output (automated and/or manual etc.).

Candidate models

[edit]

Disclaimer: This is work in progress and still in an early testing stage. Most importantly, we have not systematically evaluated the candidate models.

In the first step, we identified to potential candidate models: text simplification and section gists.

Simplification

[edit]

Text simplification aims to rephrase the text to make it easier to read and easier to understand while retaining the content (and meaning) of the original text.

Motivation. We have shown in previous work that content in Wikipedia is generally very difficult to read. This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults). In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text (i.e. the same text but using simpler language such as simple English) that could be surfaced to readers. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task. Implementation. We train a sequence-to-sequence language model following recent approaches in document-level text simplification[2].

As training data, we use an annotated reference dataset (WikiReaD, see below) which contains pairs of articles (original and a simplified version) obtained from matching Wikipedia with a simplified or children encyclopedia across 14 languages. We then fine-tune a pre-trained language model using the pairs of articles as samples for the model’s  input (original) and output (simplified). Specifically, we fine-tune two recent models: Flan-T5 (large) and mt0 (base). We chose these models based on an evaluation of the requirements defined above

  • Multilingual:  These models are multilingual supporting many more languages than English based on the documentation in the respective models card.
  • Openness:  The models are available under an open license (Apache2.0)
  • Resources: We are able to train (i.e fine-tune) the models inside our own infrastructure, specifically on the analytics-clients (stat-boxes) which have a (single) GPU. It is possible to host the trained models in the current infrastructure in LiftWing. If, in the future, our infrastructure improved to allow for training/hosting larger models, we can easily adapt these models using larger model variants of the same model family.
  • Use-case: The model families Flan-T5 and mt0 (based on the mT5 family) have been previously used for training multilingual text simplification models[3].
  • Quality: Previous works have reported good performance of these models in text simplification (though mostly on sentences). Exact evaluation still needs to be done.

More details about the implementation will be added.

Example. As an example, lets take the first paragraph of the lead section of the article on Hypatia

Hypatia (born c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician who lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. Although preceded by Pandrosion, another Alexandrian female mathematician, she is the first female mathematician whose life is reasonably well recorded. Hypatia was renowned in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus's thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus's original text, and another commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest, based on the title of her father Theon's commentary on Book III of the Almagest.

The resulting simplified version from the model is:

Hypatia (c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician. She lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. She is the first female mathematician whose life is fairly well recorded. Hypatia was known in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus' thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus' original text. She also wrote a commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest

The Flesch Kincaid grade level of the latter is 10.2 (vs 12.3 of the original), showing that its easier to read based on the readability score.

Evaluation. t.b.a.

Section gists

[edit]

Section gists are plain language summaries of sections of articles. They thus combine simplification with summarization of content.

Motivation. The idea of section-gists is taken from the paper Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing[4]. It aims to improve access to medical papers for readers (outside of Wikipedia). Based on interviews with readers about barriers for interacting with content and usability testing, they identify section-gists as a valuable and most frequently-used feature by non-expert readers. Specifically, they generate section gists automatically by prompting an LLM to create "a summary for a 5th-grader" (i.e. combining summarization and simplification).

Here, we adapt the same framework to Wikipedia articles.  Based on initial discussions with folks in the Web Team, section-gists could align very well with some of the ideas the team is thinking about exploring as experiments with readers in Wikipedia. Implementation. We use the Aya 23 model to generate section gists. This model is a good candidate because of the following aspects:

  • Multilingual:  The main advantage is that it supports 23 languages supposedly covering half the world's population in terms of speakers (more than any comparable LLM that I am aware of).
  • Openness:  It is an open-weight model with CC-BY-NC license. It can be used via huggingface.
  • Resources: We will likely be able to host the model in our own infrastructure based on recent experiments with similarly-sized models (T369055).
  • Use-case: Previous works (such as the Paper Plain paper mentioned above) generated section using similar LLMs via prompt for text generation asking for a summary at a certain grade level. Thus, the Aya model seems suitable for the task at hand.
  • Quality: In the technical report it is shown that the model outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. Specifically, it is shown to perform well on summarization task (Sec. 5.3). With the task operationalized as a variant of summarization, we can expect that the model can, in principle, yield good results. Though, in practice, this is difficult to evaluate automatically.

We can run the model on the text of individual sections by prompting the model in the following way:

## Instructions
Summarize the text below for a 7-th grader in {LANGUAGE}. Just return the summary.

## Input text
{TEXT}

There are different options to adapt the section gist in terms of

  • Length (specify the maximum number of tokens)
  • Readability level (e.g. specify a different grade level for the summary)
  • Etc.

Example. As an example, lets take the lead section of the article on Hypatia

Hypatia (born c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician who lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. Although preceded by Pandrosion, another Alexandrian female mathematician, she is the first female mathematician whose life is reasonably well recorded. Hypatia was renowned in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus's thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus's original text, and another commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest, based on the title of her father Theon's commentary on Book III of the Almagest.

Hypatia constructed astrolabes and hydrometers, but did not invent either of these, which were both in use long before she was born. She was tolerant toward Christians and taught many Christian students, including Synesius, the future bishop of Ptolemais. Ancient sources record that Hypatia was widely beloved by pagans and Christians alike and that she established great influence with the political elite in Alexandria. Toward the end of her life, Hypatia advised Orestes, the Roman prefect of Alexandria, who was in the midst of a political feud with Cyril, the bishop of Alexandria. Rumors spread accusing her of preventing Orestes from reconciling with Cyril and, in March 415 AD, she was murdered by a mob of Christians led by a lector named Peter.

Hypatia's murder shocked the empire and transformed her into a "martyr for philosophy", leading future Neoplatonists such as the historian Damascius (c. 458 – c. 538) to become increasingly fervent in their opposition to Christianity. During the Middle Ages, Hypatia was co-opted as a symbol of Christian virtue and scholars believe she was part of the basis for the legend of Saint Catherine of Alexandria. During the Age of Enlightenment, she became a symbol of opposition to Catholicism. In the nineteenth century, European literature, especially Charles Kingsley's 1853 novel Hypatia, romanticized her as "the last of the Hellenes". In the twentieth century, Hypatia became seen as an icon for women's rights and a precursor to the feminist movement. Since the late twentieth century, some portrayals have associated Hypatia's death with the destruction of the Library of Alexandria, despite the historical fact that the library no longer existed during Hypatia's lifetime.

We would get the following section gist:

Hypatia was a smart and brave woman who lived a long time ago in a place called Alexandria, Egypt. She was a teacher and a philosopher, and she knew a lot about math and the stars. She was liked by many people, both Christians and non-Christians. She taught Christian students and was friends with a man named Orestes, who was in charge of Alexandria. Some people were jealous of her, and they spread rumors about her. One day, a group of people attacked her, and she was killed. Her death made a lot of people very sad, and she became famous for standing up for what she believed in. People still remember her today as a brave woman who fought for her beliefs and helped others learn and understand the world.

You can use the tutorial-notebook to run the model yourself on PAWS. Currently, the model uses Cohere’s API for which you need to create a free trial key. The notebook contains some examples. You can adapt the notebook to get section gists for different articles and in different languages.

Evaluation. t.b.a.

Resources

[edit]
  • WikiReaD The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

References

[edit]
  1. Trokhymovych, Mykola, Indira Sen, and Martin Gerlach. "An Open Multilingual System for Scoring Readability of Wikipedia." arXiv preprint arXiv:2406.01835 (2024). https://arxiv.org/abs/2406.01835
  2. Sun, Renliang; Jin, Hanqi; Wan, Xiaojun (2021). "Document-Level Text Simplification: Dataset, Criteria and Baseline". Association for Computational Linguistics. pp. 7997–8013. doi:10.18653/v1/2021.emnlp-main.630. 
  3. Joseph, Sebastian; Kazanas, Kathryn; Reina, Keziah; Ramanathan, Vishnesh; Xu, Wei; Wallace, Byron; Li, Junyi (2023). "Multilingual Simplification of Medical Texts". Association for Computational Linguistics. pp. 16662–16692. doi:10.18653/v1/2023.emnlp-main.1037. 
  4. August, Tal; Wang, Lucy Lu; Bragg, Jonathan; Hearst, Marti A.; Head, Andrew; Lo, Kyle (2023-10-31). "Paper Plain : Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing". ACM Transactions on Computer-Human Interaction 30 (5): 1–38. ISSN 1073-0516. doi:10.1145/3589955.