Jump to content

Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification

From Meta, a Wikimedia project coordination wiki

This page captures the work on hypothesis WE.3.1.3 as part of Product & Tech’s Annual Plan for Fiscal year 24-25:

If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.

Summary

[edit]

The hypothesis was confirmed.

  • We implemented an LLM in our infrastructure to generate simple summaries of sections of Wikipedia articles

Main deliverables:

Major lessons

  • Our new ML-Lab servers support state-of-the-art multilingual models for text generation. However, additional work on optimizing these models to reduce latency and memory footprint.
  • Evaluation of the model output is challenging. The lack of simple metrics to judge the quality of the simples summaries (and any generated text) makes it difficult to iteratively improve the model via offline experiments (without asking human raters).

Next steps

  • The crucial next step is to optimize the model latency (memory footprint and inference time) via quantization or other suitable approaches.

Current status

[edit]

2024-07-04: set up page

2024-07-15: Identified model requirements for respective tasks

2024-08-05: Testing candidate models

2024-09: Decision for use-case to generate simple summaries based on feedback from Web Team's experiments

2024-10: Identification of suitable metrics for evaluating simple summaries via a set of guard rail metrics

2024-11: Implementing model for simple summaries in ML-Lab servers and LiftWing

2024-12: Documentation/closing

Background

[edit]

One of the objective's in the Annual Plan is around the Reader experience (WE3): A new generation of consumers arrives at Wikipedia to discover a preferred destination for discovering, engaging, and building a lasting connection with encyclopedic content. The goals are to:

  • Retain existing and new generations of consumers and donors.
  • Increase relevance to existing and new generations of consumers by making our content more easy to discover and interact with.
  • Work across platforms to adapt our experiences and existing content, so that encyclopedic content can be explored and curated by and to a new generation of consumers and donors.

As part of the Key Result WE.3.1 towards this goal, we want to explore opportunities for readers to more easily discover and learn from content they are interested in. In this project, we focus on models for simplifying the existing content on Wikipedia.

The Readability Gap:

  • We have shown in previous work that content in Wikipedia is generally very difficult to read. [1] This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults).
  • There are some Wikipedias with articles using a decidedly simpler language such as Simple English Wikipedia or children’s encyclopdias (Vikidia, Txikipedia, Klexikon, Wikikids). However, they exist only in few languages (compared to the more than 300 languages in Wikipedia) and cover only a much smaller number of articles (for example, as of July 2024 Simple English Wikipedia contains around 250K articles vs 6.8M in English Wikipedia)

Automatic Simplification:

  • In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text that could be surfaced to the reader. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task.
  • In previous exploratory work, we showed that its possible to automatically generate simplified versions of text with some success even across languages beyond English

Goals

[edit]
  • [Done] Identify requirements (infrastructure, performance, quality, languages, context, etc)
  • [Done] Review candidate models compatible with requirements
  • [Done] Implement one or more candidate models

Defining model requirements

[edit]

In order to decide which model to use for the corresponding tasks, I identified the following requirements:

  • Multilingual: The model should support at least some languages other than English; ideally, as many languages as possible from the more than 300 languages in Wikipedia.
  • Openness: The model should be open so we can deploy it as a production service in our own infrastructure (LiftWing). Which definition of open needs to be determined.
  • Resources: We need to be able to host the model in our infrastructure in LiftWing. This sets a limit on the model size (e.g. number of parameters). Additional constraints come from performance, e.g., the time to return results should be limited.
  • Use-case: Does the model have a chance to be effective for the respective task (based on Research and what we know)? Has the model been used for this task or similar tasks before?
  • Quality: The output of the model needs to be useful, e.g. the quality should pass some threshold. This requires some evaluation of the model output (automated and/or manual etc.).

Candidate models

[edit]

In the first step, we identified to potential candidate models: text simplification and section gists.

Simplification

[edit]

Text simplification aims to rephrase the text to make it easier to read and easier to understand while retaining the content (and meaning) of the original text.

Motivation. We have shown in previous work that content in Wikipedia is generally very difficult to read. This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults). In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text (i.e. the same text but using simpler language such as simple English) that could be surfaced to readers. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task. Implementation. We train a sequence-to-sequence language model following recent approaches in document-level text simplification[2].

As training data, we use an annotated reference dataset (WikiReaD, see below) which contains pairs of articles (original and a simplified version) obtained from matching Wikipedia with a simplified or children encyclopedia across 14 languages. We then fine-tune a pre-trained language model using the pairs of articles as samples for the model’s  input (original) and output (simplified). Specifically, we fine-tune two recent models: Flan-T5 (large) and mt0 (base). We chose these models based on an evaluation of the requirements defined above

  • Multilingual:  These models are multilingual supporting many more languages than English based on the documentation in the respective models card.
  • Openness:  The models are available under an open license (Apache2.0)
  • Resources: We are able to train (i.e fine-tune) the models inside our own infrastructure, specifically on the analytics-clients (stat-boxes) which have a (single) GPU. It is possible to host the trained models in the current infrastructure in LiftWing. If, in the future, our infrastructure improved to allow for training/hosting larger models, we can easily adapt these models using larger model variants of the same model family.
  • Use-case: The model families Flan-T5 and mt0 (based on the mT5 family) have been previously used for training multilingual text simplification models[3].
  • Quality: Previous works have reported good performance of these models in text simplification (though mostly on sentences). Exact evaluation still needs to be done.

Code for implementing the model can be found in this repository: https://gitlab.wikimedia.org/repos/research/text-simplification/-/tree/main

Example. As an example, lets take the first paragraph of the lead section of the article on Hypatia

Hypatia (born c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician who lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. Although preceded by Pandrosion, another Alexandrian female mathematician, she is the first female mathematician whose life is reasonably well recorded. Hypatia was renowned in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus's thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus's original text, and another commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest, based on the title of her father Theon's commentary on Book III of the Almagest.

The resulting simplified version from the model is:

Hypatia (c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician. She lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. She is the first female mathematician whose life is fairly well recorded. Hypatia was known in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus' thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus' original text. She also wrote a commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest

The Flesch Kincaid grade level of the latter is 10.2 (vs 12.3 of the original), showing that its easier to read based on the readability score.

Section gists

[edit]

Section gists are plain language summaries of sections of articles. They thus combine simplification with summarization of content.

Motivation. The idea of section-gists is taken from the paper Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing[4]. It aims to improve access to medical papers for readers (outside of Wikipedia). Based on interviews with readers about barriers for interacting with content and usability testing, they identify section-gists as a valuable and most frequently-used feature by non-expert readers. Specifically, they generate section gists automatically by prompting an LLM to create "a summary for a 5th-grader" (i.e. combining summarization and simplification).

Here, we adapt the same framework to Wikipedia articles.  Based on initial discussions with folks in the Web Team, section-gists could align very well with some of the ideas the team is thinking about exploring as experiments with readers in Wikipedia. Implementation. We use the Aya 23 model to generate section gists. This model is a good candidate because of the following aspects:

  • Multilingual:  The main advantage is that it supports 23 languages supposedly covering half the world's population in terms of speakers (more than any comparable LLM that I am aware of).
  • Openness:  It is an open-weight model with CC-BY-NC license. It can be used via huggingface.
  • Resources: We will likely be able to host the model in our own infrastructure based on recent experiments with similarly-sized models (T369055).
  • Use-case: Previous works (such as the Paper Plain paper mentioned above) generated section using similar LLMs via prompt for text generation asking for a summary at a certain grade level. Thus, the Aya model seems suitable for the task at hand.
  • Quality: In the technical report it is shown that the model outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. Specifically, it is shown to perform well on summarization task (Sec. 5.3). With the task operationalized as a variant of summarization, we can expect that the model can, in principle, yield good results. Though, in practice, this is difficult to evaluate automatically.

We can run the model on the text of individual sections by prompting the model in the following way:

## Instructions
Summarize the text below for a 7-th grader in {LANGUAGE}. Just return the summary.

## Input text
{TEXT}

There are different options to adapt the section gist in terms of

  • Length (specify the maximum number of tokens)
  • Readability level (e.g. specify a different grade level for the summary)
  • Etc.

Example. As an example, lets take the lead section of the article on Hypatia

Hypatia (born c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician who lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. Although preceded by Pandrosion, another Alexandrian female mathematician, she is the first female mathematician whose life is reasonably well recorded. Hypatia was renowned in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus's thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus's original text, and another commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest, based on the title of her father Theon's commentary on Book III of the Almagest.

Hypatia constructed astrolabes and hydrometers, but did not invent either of these, which were both in use long before she was born. She was tolerant toward Christians and taught many Christian students, including Synesius, the future bishop of Ptolemais. Ancient sources record that Hypatia was widely beloved by pagans and Christians alike and that she established great influence with the political elite in Alexandria. Toward the end of her life, Hypatia advised Orestes, the Roman prefect of Alexandria, who was in the midst of a political feud with Cyril, the bishop of Alexandria. Rumors spread accusing her of preventing Orestes from reconciling with Cyril and, in March 415 AD, she was murdered by a mob of Christians led by a lector named Peter.

Hypatia's murder shocked the empire and transformed her into a "martyr for philosophy", leading future Neoplatonists such as the historian Damascius (c. 458 – c. 538) to become increasingly fervent in their opposition to Christianity. During the Middle Ages, Hypatia was co-opted as a symbol of Christian virtue and scholars believe she was part of the basis for the legend of Saint Catherine of Alexandria. During the Age of Enlightenment, she became a symbol of opposition to Catholicism. In the nineteenth century, European literature, especially Charles Kingsley's 1853 novel Hypatia, romanticized her as "the last of the Hellenes". In the twentieth century, Hypatia became seen as an icon for women's rights and a precursor to the feminist movement. Since the late twentieth century, some portrayals have associated Hypatia's death with the destruction of the Library of Alexandria, despite the historical fact that the library no longer existed during Hypatia's lifetime.

We would get the following section gist:

Hypatia was a smart and brave woman who lived a long time ago in a place called Alexandria, Egypt. She was a teacher and a philosopher, and she knew a lot about math and the stars. She was liked by many people, both Christians and non-Christians. She taught Christian students and was friends with a man named Orestes, who was in charge of Alexandria. Some people were jealous of her, and they spread rumors about her. One day, a group of people attacked her, and she was killed. Her death made a lot of people very sad, and she became famous for standing up for what she believed in. People still remember her today as a brave woman who fought for her beliefs and helped others learn and understand the world.

You can use the tutorial-notebook to run the model yourself on PAWS. Currently, the model uses Cohere’s API for which you need to create a free trial key. The notebook contains some examples. You can adapt the notebook to get section gists for different articles and in different languages.

Simple Summaries

[edit]

Our main goal for the hypothesis is to implement a model that can generate simple summaries of sections of articles. This is very similar to the concept of section gists discussed above. The main reason to focus on this model is that the Web Team has identified this as a relevant use-case as part of their experiments.

Task

[edit]

Considering the text of a section of an article, a simple summary has the following features:

  • Summary: It is substantially shorter than the original section but still capturing the main information.
  • Simplicity: It is substantially easier to read (e.g. it improves the readability score).
  • Meaning preservation: Its content is factually consistent with the information contained in the text of the article.

Implementation

[edit]

We use the Aya-expanse model, specifically the Aya-expanse-32b. The model is an improvement over the Aya-23 model that we considered in earlier exploratory research (see above). It is an open-weights model and is the state-of-the-art for multilingual AI supporting 23 languages. Given the comparably moderate size of the model (32b parameters) allows us to implement and host it in our own infrastructure.

ML-lab

[edit]

We have implemented the model on the ML-Lab servers using the transformer library. Note that, we use a different datatype (float16 instead of the default flat32) in order to reduce the memory footprint.

# Loading the model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # check if GPU is available

model_id = "CohereForAI/aya-expanse-32b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)


Using a prompt (and optionally a preamble), we can generate a summaries in the following way:

def generate_aya(
    model, 
    tokenizer, 
    data_in,
    temperature=0.3,
    top_p=1.0,
    top_k=0,
    max_new_tokens=256,
    do_sample = True
):
    # format the input
    preamble = data_in["preamble"]
    prompt = data_in["prompt"]
    messages = [
        {"role": "system", "content": preamble},
        {"role": "user", "content": prompt}
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, 
        tokenize=True, 
        add_generation_prompt=True, 
        return_tensors="pt"
    ).to(model.device)
    
    # generate summary
    gen_tokens = model.generate(
        input_ids, 
        top_p=top_p,
        top_k=top_k,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        do_sample=do_sample, 
    )

    # format the output
    gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens = False)
    return gen_text


An example prompt and template could look like this (where language is the language in which the article is written, and input_text is the text from the original article section)

preamble = """You are writing encyclopedic articles in the style of Wikipedia using simplified language and a neutral tone."""

Prompt = """## Instructions
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. Return only the summary.

## Input text
{input_text}"""


With these setup, the model’s memory footprint is 60GB and thus fits into memory of a single GPU. Generating a simple summary for a single section of an article takes around 10s. These numbers can be further reduced through additional optimization (see Open problems and next steps below)

Example notebook: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_aya_example.ipynb

LiftWing

[edit]

We successfully built a test-deployment of the model in a staging environment (only accessible internally from, e.g., the stat-machines).

Example query to the (smaller) Aya-expanse-8b model:

$ curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: aya.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "aya-expanse-8B", "prompt": ".", "max_tokens": 100}'


Implementation details: This is deployed using the huggingface runtime available in kserve which has an OpenAI API integrated ("openai/v1/completions").

Evaluation

[edit]

In order to assess whether the model works in practice and to iteratively improve it, it is crucial to evaluate the performance using some evaluation metric.

With recent Large Language Models, the evaluation of natural language generation (NLG) or text generation is a difficult and unsolved task[5]. There are existing and commonly-used automatic metrics for tasks around summarization (e.g. Rouge) or simplification (e.g. BLEU, SARI, etc). However, these metrics suffer from many drawbacks:[6]

  • The metrics typically require a ground-truth or reference dataset. In practice, for many tasks such data is not easily available, especially when looking at languages beyond English. For example, we do not have readily available and verified simple summaries of Wikipedia articles.
  • The metrics have been shown to correlate poorly with human judgement. That is, while they are convenient to calculate, they do not necessarily align with how humans would rate the quality of the generated text. For example, for some text simplification metrics it has been shown that “low scores” indicate “bad quality”; whereas, in contrast, “high scores” do not necessarily imply “good quality” of the simplification.
  • They are often not easily interpretable. For example, the SARI score is an average of 2 F1-scores (for add and keep operations) and a precision score (delete operations). As a result, it is not clear what value of SARI should be considered acceptable or good enough.  

Overall, this renders the common summarization/simplification not very useful when making decisions about whether to or which model to deploy in practice. As an alternative approach, we can use a set of simpler, easy-to-interpret guardrail metrics to assess specific aspects about the quality of the generated simple summaries. We focus on three aspects that are typically considered when asking human judges to rate simplifications[7]; and define an automatic metric as a proxy:

  • Simplicity captures the readability of the generated text (i.e. how easy it is to read). We calculate the readability score, such as the Flesch-Kincaid grade level (for English) or the multilingual readability score (beyond English)[1]. Ideally, the grade-level of the simple summary is lower than for the original article.
  • Fluency captures the degree to which the generated text is grammatical. We calculate the number of grammar and spelling errors, e.g., using LanguageTool. Ideally, the simple summary does not have any grammatical or spelling errors.
  • Meaning preservation captures whether the generated text is factually consistent with the original text. We calculate the score (probability) that the generated text is entailed by the original text using the SummaC model [8]. This score is between 0 (no entailment/inconsistent) and 1 (high entailment/consistent). Ideally, the score is above some threshold (say 0.4) to make sure that information in the simple summary is consistent with the original article.

These 3 guardrail metrics provide interpretable information about specific aspects about the quality generated simple summaries. They are meant to help identify potential issues with individual simple summaries, so they can be checked and filtered (if needed) during post-processing. For example, simple summaries with low scores on meaning preservation (say below 0.25) are likely to contain information that is not contained in the original text (e.g. hallucinations). Naturally, there are other aspects of quality, such as Neutral point of view or the tone, for which we do not have readily available metrics (see Open problems below).


Example notebook: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_eval-guardrail_example.ipynb

Open problems and next steps

[edit]

In this hypothesis, we demonstrated the feasibility for running/hosting a model to generate simple summaries in our own infrastructure. If the model is considered to be useful, the next step would be to scale the model for deployment. Specifically, we would need to optimize how we run the model in order to reduce memory footprint and inference time. This work is beyond the scope of the current task and deserves a dedicated task. This includes (but is not limited to)

  • inference optimization on GPU: This requires systematic investigation of different available options and if/how they work in our infrastructure (e.g. many approaches are not supported with RoCm GPUs). In turn, we need to also better understand the trade-off with model quality in order to make sure that the output is still acceptable for the task at hand. Although using the prebuilt huggingface runtime provides a simpler way to deploy models it doesn't facilitate the level of customization we want to have at this time in order to explore. This involves
The above need to be tested on ml-lab and then deployed to Lift Wing. Deployment on Lift Wing involves building the packages from source for the required GPU architecture in a a way that is reproducible so that we can iterate and update them when needed.
  • Batch inference. The current implementation is generating one sample at a time. However, it is possible to run the model in batches which can reduce the time per sample (though with an additional memory cost). For example, see here.  

In addition, we identified open problems which could be fixed in future iterations of the model:

  • Improve the evaluation of the model quality via additional guardrail metrics to also capture important aspects which we are currently ignoring such as: NPOV or tone. These automatic metrics are crucial to track overall performance when iterating on details of the model. Similar issues about challenges of evaluation will likely occur in other similar tasks which depend on text generation.
  • Improve the prompt to balance different aspects of the simple summaries (e.g. improve simplicity while preserving an encyclopedic tone as well as factual consistency). As we scale the model to more articles/languages, we will likely discover new issues with the generated simple summaries, which will reuqire further improvement of the prompts to generate simple summaries.

Resources

[edit]
  • WikiReaD The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).
  • Repository for running model to generate simple summaries: https://gitlab.wikimedia.org/repos/research/simple-summaries

References

[edit]
  1. a b Trokhymovych, Mykola, Indira Sen, and Martin Gerlach. “An Open Multilingual System for Scoring Readability of Wikipedia.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 6296–6311. Bangkok, Thailand: Association for Computational Linguistics, 2024. https://doi.org/10.18653/v1/2024.acl-long.342.
  2. Sun, Renliang; Jin, Hanqi; Wan, Xiaojun (2021). "Document-Level Text Simplification: Dataset, Criteria and Baseline". Association for Computational Linguistics. pp. 7997–8013. doi:10.18653/v1/2021.emnlp-main.630. 
  3. Joseph, Sebastian; Kazanas, Kathryn; Reina, Keziah; Ramanathan, Vishnesh; Xu, Wei; Wallace, Byron; Li, Junyi (2023). "Multilingual Simplification of Medical Texts". Association for Computational Linguistics. pp. 16662–16692. doi:10.18653/v1/2023.emnlp-main.1037. 
  4. August, Tal; Wang, Lucy Lu; Bragg, Jonathan; Hearst, Marti A.; Head, Andrew; Lo, Kyle (2023-10-31). "Paper Plain : Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing". ACM Transactions on Computer-Human Interaction 30 (5): 1–38. ISSN 1073-0516. doi:10.1145/3589955. 
  5. Gehrmann, Sebastian, Elizabeth Clark, and Thibault Sellam. “Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text.” Journal of Artificial Intelligence Research 77 (May 29, 2023): 103–66. https://doi.org/10.1613/jair.1.13715.
  6. Alva-Manchego, Fernando, Carolina Scarton, and Lucia Specia. “The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification.” Comput. Linguist. Assoc. Comput. Linguist. 47, no. 4 (December 23, 2021): 861–89. https://doi.org/10.1162/coli_a_00418.
  7. Alva-Manchego, Fernando, Carolina Scarton, and Lucia Specia. “Data-Driven Sentence Simplification: Survey and Benchmark.” Comput. Linguist. Assoc. Comput. Linguist. 46, no. 1 (March 2020): 135–87. https://doi.org/10.1162/coli_a_00370.
  8. Laban, Philippe, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. “SummaC: Re-Visiting NLI-Based Models for Inconsistency Detection in Summarization.” Edited by Brian Roark and Ani Nenkova. Transactions of the Association for Computational Linguistics 10 (2022): 163–77. https://doi.org/10.1162/tacl_a_00453.