Grants:Programs/Wikimedia Research Fund/Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning

From Meta, a Wikimedia project coordination wiki
Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning
start and end dates1 August 2022
end date31 July 2023
budget (USD)49,867.21 USD
applicant(s)• Csisc



Applicant's Wikimedia username. If one is not provided, then the applicant's name will be provided for community review.


Project title

Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning

Entity Receiving Funds

Provide the name of the individual or organization that would receive the funds.

Data Engineering and Semantics Research Unit, University of Sfax, Tunisia (DES-Unit)

Research proposal[edit]


Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

Nowadays, semantic resources have been proven as efficient to drive computer applications in a variety of domains, particularly healthcare[1]. Semantic resources provide detailed knowledge about various aspects of medicine including diseases, drugs, genes, and proteins[2] and they can consequently be used to retrieve, process, and represent clinical information including electronic health records[3] and scholarly publications[4]. These databases enable biomedical relation extraction[4], biomedical relation classification[5], biomedical data validation[6], biomedical data augmentation[7], and biomedical decision support[8]. However, the implementation of such biomedical resources in the Global South, particularly Africa, is still limited due to the lack of consistent funding and human capacities[9]. Here, open knowledge graphs, particularly Wikidata, can be valuable to reduce the financial and technical burden of developing digital health in developing countries[2]. As a free and collaborative large-scale multilingual knowledge graph, Wikidata became a confirmed database that has the ability to represent multiple kinds of clinical information, particularly in the context of COVID-19[10]. Its representation in the RDF format enables the flexible enrichment of biomedical information using computer programs and crowdsourcing, the intrinsic and extrinsic validation of clinical knowledge, and the extraction of features from the medical data for decision making and human and machine learning[10]. Yet, Wikidata still lack a full representation of several facets of biomedical informatics [2] and its data suffers from critical inconsistencies[11]. For instance, Wikidata items about genes, proteins, and drugs have an average of 10+ statements per item while anatomical structures have only an average of less than 4.6 statements per item[2]. Furthermore, more than 90% of the Wikidata statements about human genes and proteins are supported by references whereas only less than 50% of the statements about the anatomical structures are assigned references[2]. Moreover, the linguistic representation of biomedical entities in Wikidata is dominated by German and English when other natural languages are partially or rarely covered[10].
In this research project, we propose to:
Turn Wikidata into a large-scale biomedical semantic resource covering most of the aspects of the clinical practice in a significant way: This is allowed thanks to the development of bots and tools to mass import clinical information from external resources already aligned with Wikidata and to the creation of machine learning algorithms to extract clinical information from the full texts and bibliographic information of scholarly publications indexed in PubMed, a large-scale bibliographic database hosted by National Center for Biotechnology Information and National Institutes of Health. This implies the enrichment of the facets of biomedical knowledge represented in Wikidata and the support of new kinds of clinical information that were not covered by Wikidata during the last few years.
Validate the biomedical information freely available in Wikidata: This is enabled thanks to comparison to external resources through the use of semantic alignments between Wikidata items and external biomedical resources, and to intrinsic validation through the use of SPARQL for identifying mismatches between statements and the use of shape-based methods such as ShEx and SHACL as well as property constraints for verifying the accuracy of the formatting and data modelling of the clinical knowledge in Wikidata. These methods are coupled to the development of Wikidata Game-like human validation tools of medical information in Wikidata.
Promote the biomedical use of Wikidata in the Global South: This is permitted thanks to online capacity building events for the biomedical community in Africa about Wikidata and its tools, to the publication of surveys and position papers on biomedical applications of Wikidata in highly-referred research journals. The integration of Wikidata into Fast Healthcare Interoperability Resources, a semantic system for driving Electronic Health Records, is also envisioned to enable the use of human-validated subsets of Wikidata information for clinical reasoning in the Global South.
This project only uses public data and text, and never touches private, restricted, or personal data or health information.


Approximate amount requested in USD.


Budget Description Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

We request an amount of 49,867.21 USD for this research project.

By Time[edit]

Month Expenses
Month 0 17 696.50 USD
Month 6 12 912.09 USD
Month 9 8,837.72 USD
Month 12 10,420.90 USD
Overall 49,867.21 USD


25,970.71 USD is requested for salaries for seven individuals. This covers the creation of research papers related to our scholarly efforts, the development of bots and tools for enriching and validating medical outputs in Wikidata, the networking efforts with data-driven Biomedical Informatics R&D Industry across the African continent, and the preparation and organization of online events for the dissemination of medical applications of Wikidata.

Scientist Month 0 Month 6 Month 9 Month 12 Overall
Mohamed Ben Aouicha 650.00 USD 1,991.17 USD 995.58 USD 995.58 USD 4,632.33 USD
Mohamed Ali Hadj Taieb 650.00 USD 1,991.17 USD 995.58 USD 995.58 USD 4,632.33 USD
Houcemeddine Turki 700.00 USD 1,991.17 USD 995.58 USD 995.58 USD 4,682.33 USD
Khalil Chebil - 495.58 USD 495.58 USD 991.17 USD 1,982.33 USD
Lane Rasberry - 1,327.45 USD 1,327.44 USD 1,327.44 USD 3,982.33 USD
Daniel Mietchen - 1,015.55 USD 2,031.09 USD 1,015.55 USD 4,062.19 USD
Anastassios Pouris - - 1,996.87 USD - 1,996.87 USD
Salaries 2,000 USD 8,812.09 USD 8,837.72 USD 6,320.90 USD 25,970.71 USD

Development Equipments[edit]

We requested to purchase development equipments at the cost of 15,066.50 USD. This will be interestingly useful to implement machine-learning algorithms for enriching and validating Wikidata. Validating all the Wikidata medical output by hand is time-consuming and can lack effectiveness. However, the automatic recognition of Wikidata statements to be added, removed or updated can help us gain time. To ensure the accuracy of the output provided by algorithms, we will add a validation layer by human experts for the retrieved data before applying them to Wikidata.

Device Month 0 Month 6 Month 9 Month 12 Overall
GPU NVIDIA A100 9,955.83 USD - - - 9,955.83 USD
SSD SAS Hard Drive 2 TB * 4 1,659.31 USD - - - 1,659.31 USD
2 Laptops 3,451.36 USD - - - 3,451.36 USD
Equipments 15,066.50 USD - - - 15,066.50 USD

OA and Software Expenses[edit]

We requested 8,830 USD as OA and Software Expenses. These expenses are meant to cover the article processing fees for our upcoming research publications (2-6 Papers) in highly-referred journals so that they can be published as Open Access Outputs, allowing the easy access to our research results by communities in the Global South for educational and reproducibility purposes. As well, they cover registration to highly-referred scholarly conferences for demonstrating and discussing our work with the scientific community. Grammarly Premium (300 USD) is required to check the grammar of our research publication. The primary institution applying to the Grant is located in an African country when English is not as used as Arabic and French. Grammarly will save time on proofreading enabling the allocation of more time to research efforts. Overleaf Collaborator (180 USD) is required to allow the seven contributors to work on a paper together when using LaTeX. Using Overleaf will allow us to directly work on Templates without worrying about formatting. As for Zoom Pro (150 USD), we need it to organize online events to disseminate our work as well as for managing the online meetings of the consortium to develop the Project.

Service Month 0 Month 6 Month 9 Month 12 Overall
Zoom Pro 150 USD - - - 150 USD
Grammarly Premium 300 USD - - - 300 USD
Overleaf Collaborator 180 USD - - - 180 USD
Open Access Publishing - 3,500 USD - 3,500 USD 7,000 USD
Top-Tier Scholarly Conference Registration - 600 USD - 600 USD 1,200 USD
OA and Software Expenses 630 USD 4,100 USD - 4,100 USD 8,830 USD


Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

This project goes in line with three points of the 2030 Wikimedia Strategic Direction: “Coordinate Across Stakeholders”, “Increase the Sustainability of Our Movement” and “Identify Topics for Impact”. Developing a framework to update, enrich and validate Biomedical Knowledge in Wikidata will allow to ensure a better data quality for Wikidata in the healthcare context. Such a quality for a freely available resource will increase the trustworthiness of Wikidata as a reference for physicians, pharmacists, and other medical professionals. This will allow better patient management and health education in the Global South. This will fill solve representation gaps related to medical content for content, contributors and readers as defined in the knowledge gaps taxonomy. Extending the use of Wikidata for clinical practice will allow the creation of knowledge-based medical systems at a low cost. This will allow the achievement of three UN Sustainable Development Goals: “Good Health and Well-Being” (SDG 3), “Quality Education” (SDG 4), and “Sustainable Cities and Communities” (SDG 11). From the perspective of the Wikimedia Movement, the Project will be a referential for Wikimedia affiliates and communities from Africa, particularly Wikimedia Tunisia and African Wikimedia Developers Project, if they would like to continue working on the medical output of Wikidata and create projects about biomedical applications of Wikidata or if they would like to formulate a research project and apply for the next editions of Wikimedia Research Fund.


Plans for dissemination.

Past Contributions[edit]

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

  • Three applicants are active long-term contributors to Wikimedia Projects, particularly Wikidata. All of them are Wikimedia researchers and open science advocates with scholarly outputs in major research venues and presentations in the main Wikimedia conferences (e.g., Wikimania, WikidataCon) and in open science events (e.g., Mozilla Festival, Creative Commons Summit). They are also members of Wikimedia Medicine for several years:
    • Houcemeddine Turki (User Statistics): Wikimedian since 2009, he served as a Programme Committee Member for WikiIndaba Conference (2018, 2019, and 2021) and WikiConvention Francophone (2021). As well, he was a member of Wiki Indaba Steering Committee (2018-2021), a member of Wikimedia and Libraries User Group Steering Committee (2019-2021), and the Secretary of the Affiliations Committee (2022). He is currently a member of the Affiliations Committee, a member of the Wikimania 2022 Core Organizing Team, and the Vice-Chair of Wikimedia Tunisia. As a contributor to Wikimedia Projects, he has contributed for ten years to French and English Wikipedia before shifting his interest to Wikidata and Wikifunctions. His main areas of interest are reference support, Tunisia-related topics, library and information science, biomedical informatics, and Applied Linguistics. Furthermore, he is a co-founding member of Data Engineering and Semantics Research Unit, the first research structure in Tunisia to be specialized in Wikimedia Research and Open Science.
    • Daniel Mietchen (User Statistics): He is a biophysicist interested in integrating open research and education workflows with the web, particularly through open licensing, open standards, open collaboration, public version histories and forkability. With research activities spanning from the subcellular to the organismic level, from fossils to developing embryos, from biodiversity informatics to data science more broadly and how this all fits with sustainable development, he experienced multiple shades of the research cycle and a variety of approaches to collaboration, sharing and reproducibility in research contexts. He has also been contributing to Wikipedia and its sister projects for about two decades and is actively engaged in increasing the interactions between the Wikimedia and research communities, particularly around Wikidata. All of this informs his current activities as a researcher at the Fraunhofer Institute for Biomedical Engineering, at the Leibniz Institute of Freshwater Ecology and Inland Fisheries, and at the Ronin Institute.
    • Lane Rasberry (User Statistics): He is Wikimedian-in-residence at the School of Data Science at the University of Virginia. His professional interests include popular science, access to health information, clinical research, the Open Movement, data science, and Wikimedia projects. Rasberry holds a B.Sc. in Chemistry from the University of Washington in 2006. He is among the first people to propose that Wikipedia articles should be cited to ensure a better scholarly recognition of Wikipedia as a source for added-value research outputs. He has also worked on promoting the use of Wikipedia for clinical purposes such as physician-patient communication, medical education, and research. Currently, he is working on the use of Wikidata for representing and integrating clinical trials.
  • Data Engineering and Semantics Unit is the first research structure in Tunisia to be specialized in Wikimedia Research. It organized two Wikimedia-funded events for the dissemination of Wikidata (AICCSA 2017 Wikidata Presentation and SPARQL: Be connected to Wikidata) and developed the first Tunisian Wikidata user script, Toolforge tool, and bot with the collaboration of Wikimedia Tunisia and through the support of WikiCred Grant Initiative. In 2021, Data Engineering and Semantics has established an advanced research collaboration with Sisonkebiotik, an open African community for Biomedical Machine Learning. They are jointly working on a project entitled "Semantic Applications for Biomedical Data Science" that looks for developing applications of open biomedical knowledge graphs to support clinical efforts. The main two co-founders of the Data Engineering and Semantics Research Unit have been interested in publishing real-life applications of the Wikimedia Projects since 2012. Their research outputs have been successfully published in highly-referred scholarly journals, particularly Engineering Applications of Artificial Intelligence, Knowledge-Based Systems, Journal of Biomedical Informatics, Applied Intelligence, and Neurocomputing:
    • Mohamed Ali Hadj Taieb is a senior researcher at Data Engineering and Semantics Research Unit. He is an assistant professor of computer science at the Faculty of Sciences of Sfax, University of Sfax, Tunisia. He holds a Ph.D. in Computer Science from University of Sfax, Tunisia in 2014. His main fields of interest are Semantic Technologies, Scientometrics, Biomedical Informatics, Big Data, Social Networks and Data Science.
    • Mohamed Ben Aouicha is the head of Data Engineering and Semantics Research Unit. He is an associate professor of computer science at the Faculty of Sciences of Sfax, University of Sfax, Tunisia. He holds a Ph.D. in Computer Science from University of Sfax, Tunisia and Paul Sabatier University of Toulouse, France in 2009 and a higher doctorate in Computer Systems Engineering from University of Sfax, Tunisia in 2016. His main fields of interest are Semantic Technologies, Scientometrics, Information Retrieval, Big Data, Social Networks and Data Science.
  • Anastassios Pouris (University of Pretoria, South Africa) and Khalil Chebil (Data Engineering and Semantics Research Unit, Tunisia) have interesting experiences of research collaborations with industry for many years and practically know what are the needs of the clinical industry in terms of data. Consequently, they have the required skills to define the facets of biomedical knowledge that need to be prioritized for consistent revision in Wikidata and to find third-party individuals to review the output of this research project for a human validation of clinical information in Wikidata.

I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.



  1. Callahan, T. J., Tripodi, I. J., Pielke-Lombardo, H., & Hunter, L. E. (2020). Knowledge-based biomedical data science. Annual review of biomedical data science, 3, 23-41. PMC:8095730.
  2. a b c d e Turki, H., Shafee, T., Hadj Taieb, M. A., Ben Aouicha, M., Vrandečić, D., Das, D., & Hamdi, H. (2019). Wikidata: A large-scale collaborative ontological medical database. Journal of biomedical informatics, 99, 103292. doi:10.1016/j.jbi.2019.103292.
  3. Sun, H., Depraetere, K., De Roo, J., Mels, G., De Vloed, B., Twagirumukiza, M., & Colaert, D. (2015). Semantic processing of EHR data for clinical research. Journal of biomedical informatics, 58, 247-259. doi:10.1016/j.jbi.2015.10.009.
  4. a b Kang, N., Singh, B., Bui, C., Afzal, Z., van Mulligen, E. M., & Kors, J. A. (2014). Knowledge-based extraction of adverse drug events from biomedical text. BMC bioinformatics, 15(1), 64:1-64:8. doi:10.1186/1471-2105-15-64.
  5. Hong, G., Kim, Y., Choi, Y., & Song, M. (2021). BioPREP: Deep learning-based predicate classification with SemMedDB. Journal of Biomedical Informatics, 122, 103888. doi:10.1016/j.jbi.2021.103888.
  6. Nicholson, N. C., Giusti, F., Bettio, M., Negrao Carvalho, R., Dimitrova, N., Dyba, T., et al. (2021). An ontology-based approach for developing a harmonised data-validation tool for European cancer registration. Journal of Biomedical Semantics, 12(1), 1:1-1:15. doi:10.1186/s13326-020-00233-x.
  7. Slater, L. T., Bradlow, W., Ball, S., Hoehndorf, R., & Gkoutos, G. V. (2021). Improved characterisation of clinical text through ontology-based vocabulary expansion. Journal of Biomedical Semantics, 12(1), 7:1-7:9. doi:10.1186/s13326-021-00241-5.
  8. Tehrani, F. T., & Roum, J. H. (2008). Intelligent decision support systems for mechanical ventilation. Artificial intelligence in medicine, 44(3), 171-182. doi:10.1016/j.artmed.2008.07.006.
  9. Odekunle, F. F., Odekunle, R. O., & Shankar, S. (2017). Why sub-Saharan Africa lags in electronic health record adoption and possible strategies to increase its adoption in this region. International journal of health sciences, 11(4), 59. PMC:5654179.
  10. a b c Turki, H., Hadj Taieb, M. A., Shafee, T., Lubiana, T., Jemielniak, D., Ben Aouicha, M., et al. (2021). Representing COVID-19 information in collaborative knowledge graphs: the case of Wikidata. Semantic Web, 13(2), 233-264. doi:10.3233/SW-210444.
  11. Turki, H., Jemielniak, D., Hadj Taieb, M. A., Labra Gayo, J. E., Ben Aouicha, M., Banat, M., et al. (2022). Using logical constraints to validate statistical information about COVID-19 in collaborative knowledge graphs: the case of Wikidata. PeerJ Computer Science (forthcoming).