Jump to content

Grants:Programs/Wikimedia Community Fund/Rapid Fund/Wikidata Reference Validator:A Tool to Check and Flag Dead External Sources (ID: 23538111)

From Meta, a Wikimedia project coordination wiki
statusFunded
Wikidata Reference Validator:A Tool to Check and Flag Dead External Sources
request or grant IDG-RF-2508-19770
proposed start date2025-09-30
proposed end date2026-02-10
requested budget (local currency)7666850 NGN
requested budget (USD)5000 USD
amount funded (USD)5000
amount funded (local currency)7339850 NGN
grant typeIndividual
funding regionSSA
decision fiscal year2025-26
applicantJosefAnthony
organization (if applicable)N/A
Review Final Report

Applicant details

[edit]
Main Wikimedia username. (required)

JosefAnthony

Organization

N/A

If you are a group or organization leader, board member, president, executive director, or staff member at any Wikimedia group, affiliate, or Wikimedia Foundation, you are required to self-identify and present all roles. (required)

N/A

Describe all relevant roles with the name of the group or organization and description of the role. (required)

Main proposal

[edit]
1. State the title of your proposal. This will also be the Meta-Wiki page title.

Wikidata Reference Validator:A Tool to Check and Flag Dead External Sources

2. and 3. Proposed start and end dates for the proposal.

2025-09-30 - 2026-02-10

4. What is your tech project about, and how do you plan to build the product?

Include the following points in your answer:

  • Project goal and problem you solve
  • Product strategy or project roadmap
  • Technical approach (infrastructure, tech stack, key tools and services)
  • Integrations or dependencies (if any)

Project goal and problem you solve

This project will develop the Wikidata Reference Validator, a tool to automatically check whether references on Wikidata are still accessible, valid, and properly formatted. A significant challenge on Wikidata is that many external references become outdated or broken over time, which reduces trust in the knowledge base. By identifying dead or problematic links, the tool will help editors improve data quality, reliability, and sustainability of Wikidata content.

Product strategy / roadmap

  • Phase 1 (MVP): Build a backend service that fetches references from a given Wikidata item (e.g., Q42) and tests whether the links are live or broken.
  • Phase 2: Create a simple web interface where users can input item IDs or lists of items and view flagged references.
  • Phase 3: Add advanced features such as filtering references by type of error, batch validation across multiple items, and downloadable reports.
  • Phase 4: Collect community feedback, refine the interface, and explore integration with Wikidata workflows (e.g., as a Toolforge service or possible gadget).

Technical approach

  • Backend: Python (Flask or FastAPI) to handle API requests and link validation.
  • Frontend: Lightweight React interface for usability, deployed on Wikimedia Toolforge.
  • Infrastructure: Hosted on Wikimedia Toolforge (Cloud VPS), connecting to the Wikidata API and performing reference validation.
  • Tools and services: Wikidata API for item data, Python libraries for link checking, and the Internet Archive Wayback Machine API to suggest archived versions of dead links.

Integrations / dependencies

  • Wikidata Query Service and API (for retrieving references).
  • Wikimedia Toolforge (for hosting the tool).
  • Internet Archive’s Wayback Machine API (for checking and retrieving archived versions of dead links).
5. What is the expected impact of your project, and how will you measure success?

Include the following points in your answer:

  • Milestones and progress tracking
  • Project impact and success metrics

Project impact

The Wikidata Reference Validator will directly improve the quality and reliability of Wikidata by helping editors identify and fix broken or unreliable references. This supports the Wikimedia movement’s mission of providing verifiable, trustworthy, and open knowledge. The tool will save editors time, encourage systematic cleanup of references, and strengthen trust in Wikidata data reused across Wikipedia, libraries, and external platforms.

Milestones and progress tracking

  • Month 1: Backend prototype (fetch Wikidata references + check link status).
  • Month 2: Web interface deployed on Toolforge for single-item validation.
  • Month 3: Batch validation, filtering options, and report export features added.
  • Month 4: Community testing and feedback, refinement, and documentation.

Progress will be tracked through GitHub and Git Lap commits, published demo versions on Toolforge, and community feedback sessions on-wiki and via mailing lists.

Success metrics

  • Technical success:
    • Tool can check references for at least 1,000 Wikidata items reliably.
    • Integration with Wikibase API.
  • Tool can check references for at least 1,000 Wikidata items reliably.
  • Community use:
    • At least 50 Wikidata editors test the tool during the pilot phase.
    • Positive feedback from community channels (measured via talk pages, surveys, or Phabricator tickets).
  • At least 50 Wikidata editors test the tool during the pilot phase.
  • Content improvement:
    • At least 5,000 broken or outdated references flagged within the first six months.
    • Demonstrated use of tool outputs in improving Wikidata items (tracked via edit summaries or reports).
  • At least 5,000 broken or outdated references flagged within the first six months.
6. Who is your target audience, and how have you confirmed there is demand for this project? How did you engage with the Wikimedia community?

Include the following points in your answer:

  • Project demand and target audience description
  • Links to interaction(s) with Wikimedia community
  • Evidence from community consultation such as the [Community Wishlist]

Project demand and target audience

The primary target audience for the Wikidata Reference Validator is Wikidata editors and patrollers who maintain data quality. These contributors often face challenges with references that are dead, inaccessible, or poorly formatted. Secondary audiences include Wikimedians working on Wikipedia and sister projects who rely on Wikidata content, as well as external reusers of Wikidata data (e.g., libraries, researchers, GLAM institutions) who need reliable references for their applications.

The demand for this project is clear:

  • Broken references are a well-known issue across Wikimedia projects. On Wikipedia, there are long-standing maintenance categories for “dead links” and templates like [dead link]. However, Wikidata currently lacks a dedicated tool to systematically check and flag invalid references.
  • This gap has been repeatedly noted in community discussions around data quality and verifiability, making it a high-impact area for improvement.

Community engagement and evidence

  • The Community Wishlist Survey has consistently shown demand for tools that improve references and citations. For example:
    • Wishlist items requesting better citation tools, dead link detection, and integration with Internet Archive show that editors want support for maintaining reliable references.
  • I have engaged with the Wikimedia community through:
    • Discussions on Wikidata Project Chat and relevant talk pages about the importance of reference checking.
    • Telegram  and mailing list conversations (e.g., in Wikidata and GLAM-Wiki groups), where broken references and verifiability are recurring pain points.
    • My own experience as a Wikimedian with contributions to Wikidata and MediaWiki, where I’ve seen firsthand how editors spend significant time manually checking references.
  • Discussions on Wikidata Project Chat and relevant talk pages about the importance of reference checking.
  • Telegram  and mailing list conversations (e.g., in Wikidata and GLAM-Wiki groups), where broken references and verifiability are recurring pain points.

This feedback and evidence confirm that there is a clear demand for a reference validation tool. The Wikidata Reference Validator directly responds to this need by filling a current gap in the ecosystem, empowering contributors to work more efficiently and strengthening Wikidata’s reliability.

7. How will your team predict and manage potential user security and privacy risks, and what risks do you currently see?

Include the following points in your answer:

  • The level of in-house or consulted security and privacy expertise you will have available to you during delivery of this project
  • How your development, testing, and deployment processes mitigate the introduction of unnecessary security or privacy risks

Available expertise

The project will be developed with awareness of Wikimedia’s existing security and privacy standards. While my team does not have a dedicated security engineer in-house, we will:

  • Follow best practices from the Wikimedia Developer Guidelines and Toolforge documentation.
  • Consult with Wikimedia technical mentors and community developers where needed, especially regarding secure handling of API requests and deployment.
  • Make use of existing open-source libraries and services with well-established security reputations (e.g., Flask/Django or Node.js with security middleware).

Identified risks and mitigations

  1. Risk: Exposure of user data (e.g., usernames, IPs).
    • Mitigation: The tool will not store or process personal data. All requests are limited to publicly available Wikidata content. Any logs will exclude personal identifiers.
  1. Risk: Insecure API interactions.

Mitigation:

      • Use only the official Wikidata/Wikimedia APIs over HTTPS.
      • Sanitize all inputs and outputs to prevent injection vulnerabilities.
      • Apply rate limiting and error handling to avoid abuse or unexpected load.
  1. Use only the official Wikidata/Wikimedia APIs over HTTPS.
  2. Sanitize all inputs and outputs to prevent injection vulnerabilities.
  3. Risk: Dependency vulnerabilities.

Mitigation:

      • Perform regular dependency checks (e.g.<code>pip-audit</code> for Python).
      • Keep the tool updated with security patches.
  1. Perform regular dependency checks (e.g.<code>pip-audit</code> for Python).
  2. Risk: Toolforge deployment issues.

Mitigation:

      • Follow Toolforge’s deployment best practices, including containerized environments where possible.
      • Restrict file permissions and ensure least-privilege access.
  1. Follow Toolforge’s deployment best practices, including containerized environments where possible.

Development & testing safeguards

  • Code reviews: All code will be reviewed before merging to ensure secure practices.
  • Automated testing: Unit tests will include validation for handling malformed inputs or unexpected data.
  • Staging environment: The tool will be tested in a sandbox environment on Toolforge before public release.
  • Community testing: Invite early Wikidata editors to test and provide feedback, catching potential risks before wide release.

By focusing on publicly available data only, and by adhering strictly to Wikimedia’s technical policies, this project minimizes privacy exposure while maintaining a strong commitment to security.

8. Who is on your team, and what is your experience?

Include the following points in your answer:

  • Your experience as a developer, relevant past projects
  • Wikimedia SUL (developer), Gerrit, Github, Gitlab or other relevant public account handles
  • Other team members, their roles and expertise
Team Lead / Developer – Josef Anthony
[edit]

===== [1]

=====

I am an active contributor in the Wikimedia technical and Wikidata communities. My experience combines both software development and Wikimedia project work.

Developer experience

  • Node.js and Python developer with backend + frontend experience.
  • Familiar with Flask, Angular, React.
  • Worked with APIs, Postman, and databases.

Wikimedia / Free Knowledge projects

  • MediaWiki &amp; Phabricator contributions:[2]
  • GLAM-Wiki projects:Mapping Nigeria’s Broadcast Heritage (2024)[3]

Nigeria's Indigenous Musical Instruments on Wikidata (2025)https://outreachdashboard.wmflabs.org/courses/Igbo_Wikimedia_User_Group/Nigeria's_indigenous_musical_instruments_on_Wikidata Revitalizing UK Diverse Historical Figures (2025)[4][5]

  • Wikidata Weekly Summary team[6]
  • Wikifunctions Functioneer[7]

Technical accounts

  • Wikimedia SUL: josefanthony
  • Gerrit: josefanthonyExample contribution: [8]
  • GitHub: [9]
  • GitLab Paulina project:[10][11]

Team compositionCurrently, I am the primary developer and maintainer. I will handle development, testing, and deployment, while engaging with the Wikimedia developer community for review and mentorship.

I am also a Wikimedia Hackathon 2025 scholarship recipient and a Wikimania 2025 scholarship recipient — recognitions of my active role in the Wikimedia technical community.

9. How will the project be maintained long-term?

Include the long-term maintenance plan with maintainer(s) in your answer. If you expect the long-term maintenance to incur expenses, please list those and the plan for long-term expense coverage.

Long-term Maintenance Plan

The project will be maintained under my leadership as the primary maintainer (Wikimedia SUL: josefanthony). My experience in developing and deploying Wikimedia-related tools ensures that I can continue to fix bugs, update dependencies, and adapt the tool to evolving Wikidata community needs.

Maintenance Approach

  • The project codebase will be hosted openly on GitHub and mirrored to Wikimedia’s GitLab for transparency and collaborative contributions.
  • Community members and contributors will be encouraged to submit issues and pull requests.
  • Bug reports and feature requests will be tracked on Wikimedia Phabricator for long-term visibility.
  • Regular updates will be aligned with Wikidata/Wikibase software changes and community workflows.

Sustainability & Handover

  • Documentation (developer setup, deployment guide, and user instructions) will be provided to make onboarding new contributors easier.
  • If I am unable to continue, the tool will be structured so another Wikimedia developer can take over with minimal effort.
  • The project will follow Toolforge/cloud best practices to ensure smooth continuity.

Expense Planning

  • As the tool will be hosted on Wikimedia Toolforge, there will be no direct hosting expenses for long-term maintenance.
  • The main expenses could arise from:
    • Occasional domain registration (if a vanity URL is desired).
    • Developer time for ongoing improvements.
  • Occasional domain registration (if a vanity URL is desired).
  • These costs will either be covered by myself, through small Wikimedia grants, or via community-driven support (e.g., mentorship or microgrants).

Community Integration

  • Since the project is designed to serve the Wikidata community, I will actively seek ongoing feedback from editors via the Wikidata Project Chat, Telegram groups, and mailing lists.
  • This ensures the tool remains relevant, maintained, and continuously improved as part of Wikimedia’s collaborative ecosystem.
10. Under what license will your code be released, and how will you ensure the product is well documented?

Include the following points in your answer:

  • Code license and compatibility with Wikimedia projects
  • Documentation plan

License & Documentation

The project code will be released under the MIT License, which is permissive, widely adopted, and fully compatible with Wikimedia projects.

To ensure long-term usability, I will maintain:

  • A comprehensive README with setup, deployment, and usage instructions.
  • Inline code documentation and clear commit messages for developers.
  • A user guide hosted on-wiki (Meta-Wiki or Wikitech) to help Wikimedians adopt and test the tool.
  • Phabricator tasks for tracking bugs, feature requests, and community feedback.

This approach guarantees that the project is both technically open and easily maintainable by the wider community.

11. Will your project depend on or contribute to third-party tools or services?

Dependencies & Contributions

The project will primarily run on Wikimedia’s Toolforge infrastructure, ensuring alignment with Wikimedia standards.It will depend on:

  • Wikidata Query Service (SPARQL endpoint) for real-time data validation.
  • Wikimedia APIs (Wikidata, MediaWiki) for entity lookups and updates.
  • Standard open-source Python libraries (e.g., <code>requests</code>, <code>pandas</code>, <code>flask</code>) for backend logic and interface.

Where possible, improvements (e.g., bug fixes, query optimizations) will be contributed back to open-source libraries or documented on Wikimedia technical channels, so the wider community benefits.

12. Is there anything else you’d like to share about your project? (optional)

This project is designed with the Wikimedia community at its core. Beyond the technical outcomes, it will:

  • Strengthen data quality on Wikidata by making validation more accessible.
  • Provide a sustainable open-source tool hosted on Toolforge, with clear documentation for contributors.
  • Leverage my active participation in the Wikimedia technical ecosystem (MediaWiki contributions, Wikidata Weekly Summary, Wikifunctions Functioneer role) to ensure strong community alignment.

I also bring recognition as a 2025 Wikimedia Hackathon and Wikimania 2025 scholarship recipient, which reflects both the community’s trust in my work and the opportunity to showcase this project at international Wikimedia events.

Budget

[edit]
13. Upload your budget for this proposal or indicate the link to it. (required)
14. and 15. What is the amount you are requesting for this proposal? Please provide the amount in your local currency. (required)

7666850 NGN

16. Convert the amount requested into USD using the Oanda converter. This is done only to help you assess the USD equivalent of the requested amount. Your request should be between 500 - 5,000 USD.

5000 USD

We/I have read the Application Privacy Statement, WMF Friendly Space Policy and Universal Code of Conduct.

Yes

Endorsements and Feedback

[edit]

Please add endorsements and feedback to the grant discussion page only. Endorsements added here will be removed automatically.

Community members are invited to share meaningful feedback on the proposal and include reasons why they endorse the proposal. Consider the following:

  • Stating why the proposal is important for the communities involved and why they think the strategies chosen will achieve the results that are expected.
  • Highlighting any aspects they think are particularly well developed: for instance, the strategies and activities proposed, the levels of community engagement, outreach to underrepresented groups, addressing knowledge gaps, partnerships, the overall budget and learning and evaluation section of the proposal, etc.
  • Highlighting if the proposal focuses on any interesting research, learning or innovation, etc. Also if it builds on learning from past proposals developed by the individual or organization, or other Wikimedia communities.
  • Analyzing if the proposal is going to contribute in any way to important developments around specific Wikimedia projects or Movement Strategy.
  • Analysing if the proposal is coherent in terms of the objectives, strategies, budget, and expected results (metrics).

Endorse


This is an automatically generated Meta-Wiki page. The page was copied from Fluxx, the web service of Wikimedia Foundation Funds, where the user has submitted their application. Please do not make any changes to this page because all changes will be removed after the next update. Use the discussion page for your feedback. The page was created by CR-FluxxBot.