Grants:Project/MSIG/Wiki Term Base/Report
Accepted

Implementation Plan
[edit]Throughout this project, we planned the development of an innovative tool for terminology management and standardization on Arabic Wikipedia. The tool aims to significantly reduce the time needed for article translation, as well as to increase content consistency and readability. We have identified three phases for the project, of which phase one has so already been successfully implemented.
Phase 1: Term Base Tool Prototype
[edit]
June 2024 - July 2025 (completed)
We successfully developed a functional prototype of a term management tool. Our tool has a database of 950K terms in Arabic, English and French. This phase involved the following steps:
- Initial user research and needs assessment: We conducted extended qualitative interviews to confirm pain points involved in translation and refine the innovative technical solutions we're proposing.
- Data collection: We prepared a list of the available dictionaries that we aim to include in our term base. We used an existing collection of over 1,000 dictionaries currently used by Arabic Wikipedians.
- Dictionary digitization: We digitized dozens of dictionaries, enlisted proofreaders to double-check the OCR-generated text, and reviewers to ensure page number accuracy.
- Database design: We designed a SQL relational database that can retrieve and aggregate terms based on morphological characteristics that we defined in advance.
- Testing: We experimented with plenty of queries to check our data's validity and the robustness of our database design and term aggregation process.
- Front-end design: Our programmer made a heroic effort in the tool's front-end design, which includes a standalone Toolforge website as well as a Wikipedia gadget that is directly accessible to editors from within the encyclopedia.
- Release and feedback analysis: Since release in March 2025, we have been extensively popularizing the tool, collecting feedback, analyzing results, and fixing bugs and small issues while deferring some to future phases.
Phase 2: Default Gadget with Syntactic Parsing
[edit]July 2025 - June 2026
We have concluded the current phase of our project by completing our planned goal: releasing a functional prototype of a term management tool, which has already been widely adopted by the Arabic Wikipedia community. After detailed research with users and team reflection, we decided to focus on three objectives for the second phase of our project (for which we're seeking more grant funding right now):
- Making WikiTermBase a default gadget: We're aiming to make WikiTermBase as widely adopted as possible, which means that our ultimate objective is becoming a default Arabic Wikipedia gadget. Reaching this goal requires meticuolos testing and technical troubleshooting, which would require extensive software development resources.
- Expand dictionary data: Our database currently includes 50 dictionaries with nearly 1 million terms, but we are still aware that various domain fields (e.g. AI) are largely missing even from this fairly big collection. We aim to digitize more dictionaries or find more ways to collect data.
- Morphological analysis and parsing: A morphological parser has been a key element in our planned architecture from the start. Arabic language is very morphologically rich, and standardizing translation terminology involves morphological parsing to determine related words and context. We are working with Arabic linguists to determine the best parsers.
Phase 3: Ontology-based, In-context Retrieval
[edit]July 2026 - mid 2027
In the third and final phase of development, we want to create a context-aware, ontology-driven tool that not only helps standardize term translations, but also represents a map of the rich semantic relationships across language. Our initial testing with large language models proved that a SQL architecture alone doesn't enable semantic matching of terms with an accuracy beyond 75 – 85%, which is far from sufficient for production levels applications. Therefore, the final product of the project would be an encompassing Arabic language ontology with a graph architecture and graph neural network integration for advanced retrieval.
- Graph database development: We will restructure our relational SQL database into a graph database that reflects the rich semantic and morphological relationships between Arabic language terms. This would create a map of rich interconnections that reflect how totally different words may be used in semantically similar contexts.
- Graph neural network training: We will train a graph neural network (GNN) to be optimize for retrieving the most accurate terms for a specific context. The graph neural network architecture will result in our ability to match words with identical meanings regardless of their morphology.
- In-context highlighting and translation: Instead of searching a database, Wikipedia editors would be able to highlight words and receive a dynamic translation based on the word's use in-context. This highly context-aware translation would enable translation at a much more efficient level than ever before.
- Research publication: The results of our work could be in invaluable addition to the research body on ontologies and GNNs, a growing an exploratory area of using language models with graph databases to optimize retrieval.
Tool Results
[edit]Original Objectives
[edit]- Develop a Wikipedia term management tool: Develop an extension or tool to help Arabic Wikipedia editors access lexicographic data while editing.
- Empower Arabic Wikipedians to standardize multilingual terms: Allow editors to choose more standard and consistent translations of foreign terms, especially in technical articles.
- Save translation time: Cut back the time needed for translating Wikipedia content by speeding up the term research and decision process.
- Boost Arabic Wikipedia's readability: By making terminology use more consistent, reliable, and universal across articles.
Deliverables and results
[edit]

- Developed a full Wikipedia gadget for term search and standardization: Release a fully functional gadget for Arabic Wikipedia editors in March 2025 providing access to the term base from within Wikipedia.
- Collected a dictionary and lexicographic database: Built an open source database with about 50 dictionaries and over 900K terms in Arabic, English and French.
- 3% Adoption rate by Arabic Wikipedia users: Over 100 people, or 3% of all active Arabic Wikipedia users, are currently using the tool.
- Tremendous feedback: Surveyed 20% of the tool's user as of April 2025, and received 90 – 95% positive ratings for usability, user experience, and impact on translation quality and time.
- Hundreds of hours saved: Users self-estimated an average 15% cutback in the time needed to translate articles from foreign languages.
Stakeholders Engaged
[edit]- Arabic Wikipedia users: We spoke with Arabic Wikipedia editors and translators on every step of the project. During ideation, we had extended qualitative interviews to confirm pain points involved in translation and refine the innovative technical solutions we're proposing. During development, we connected with the community regularly for updates. After our release in March 2025, we received 50 responses in a lively discussion on the Village Pump where users organically tested the tools, praised its impact, identified bugs, and even enthusiastically worked with us to get the tool officially added as a Wikipedia gadget and make the code open source.
- Wikimedia movement: We presented an early version of our work at WikiConference North America in Indianapolis in October 2024, inspiring communities half way across the world about the importance of integrating advanced technology into wiki translation tools.
- NLP and linguistics experts: We reached out to Columbia NLP group, a community with a strong historical interest in Arabic language, to receive expert guidance on our work. We also engaged the founders of the Arabic Ontology project (by Beirzeit university), linguists working on the Doha Historical Dictionary, and members of the reputable Damascus Language Academy as advisors throughout the project. These advisors helped us identify state-of-the-art tools for Arabic NLP, including syntactic and morphological analysis tools.
- Wikimedia research community: We got accepted as presented in the Wiki Workshop May 2025 event, where we presented our work to Wikimedia researchers. The paper is now available online as a way to communicate our methodology and finding to the academic community, as well as to inspire more future work that builds on what we accomplished.
- Wikimedia Foundation teams: We had conversations with various Wikimedia Foundation members across ideation and development. Language Team members were specifically engaged at various stages, including Niklas and Amir. Both provided valuable advice about the project's relevance to the Content Translation tool as well as the importance of linking it to Wikidata, and potentially the Lexicographic Data extension in the future.
Outcomes
[edit]Please respond to the following questions below:
Where have you published your draft plan? Share the link to it here:
- Created an Arabic Wikipedia page detailing our tool features and outcomes
- Published a research paper on the WikiTermBase project (doi:10.48550/arXiv.2505.20369)
What Movement Strategy initiative is this draft plan supporting?
- 43. Continuous experimentation, technology, and partnerships for content, formats, and devices: This project experimented with applying language technology to resolve an open problem in standardizing vocabulary. An agile design methodology was used to support this experimentation goal, leading to the development of a new tool that contributed towards recommendation 9: Innovate in Free Knowledge.
What activities have you completed to produce this draft plan?
- Tool launch: Launched an innovative tool prototype that has been adopted by 100+ Arabic Wikipedia users (nearly 3% of the entire active user base).
- Outcomes survey – 15% reduction in translation time: Conducted post launch research, revealing a 15% self-reported cut back in translation time thanks to the WikiTermBase tool, and 90-95% positive reviews of the tool's various aspects.
- Published research: Presented a research abstract at WikiWorkshop 2025 and published a paper on the topic.
- Dictionary digitization: Digitized and added nearly 50 dictionaries along with detailed bibliographic data.
- NLP expert partners: Partnered with NLP experts to ensure sound methodology in database building, term semantic mapping, syntactic analysis, etc.
- User research and product requirements gathering: Spent 1 year conducting user research, putting together product requirements, collecting data, developing the front-end tool.
In which community channels have you announced your draft plan?
- Arabic Wikipedia's Technology Village Pump
- Arabic Wikipedia's Facebook community
- WikiWorkshop 2025 abstract presentation
- Published a research paper on the WikiTermBase project (doi:10.48550/arXiv.2505.20369)
Finances
[edit]Grant funds spent
[edit]Please describe how much grant money you spent for approved expenses, and tell us what you spent it on.
Below is the table for the funding expenditure:
| Item | Original Budget | Actual Budget | Additional Comment(s) |
|---|---|---|---|
| Software development | 4,000 | 5,000 | Worked 50+ hours above the originally target amount of 150 |
| OCR proofreading | 3,000 | 2,000 | |
| Research | 1,500 | ||
| Facilitation | 1,500 | ||
| Documentation | 1,250 | Includes $100 in graphic design | |
| NLP consultants | 1,000 | 400 | Human data labeling |
| Dictionaries | 500 | 450 | |
| Translation | 250 | 200 | |
| Online tools | 250 | 0 | |
| Total | 13,000 | 10,000 | The approved budget was 3K short of the requested amount, which was communicated at the beginning. The exact budget cuts were not factored in, as we needed to be flexible with what we can cut on. |
Remaining funds
[edit]Do you have any remaining grant funds?
- No remaining funds
Anything else
[edit]Anything else you want to share about your project?