Research:Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
Nowadays, semantic resources have been proven as efficient to drive computer applications in a variety of domains, particularly healthcare. Semantic resources provide detailed knowledge about various aspects of medicine including diseases, drugs, genes, and proteins and they can consequently be used to retrieve, process, and represent clinical information including electronic health records and scholarly publications. These databases enable biomedical relation extraction, biomedical relation classification, biomedical data validation, biomedical data augmentation, and biomedical decision support. However, the implementation of such biomedical resources in the Global South, particularly in Africa, is still limited due to the lack of consistent funding and human capacities. Here, open knowledge graphs, particularly Wikidata, can be valuable to reduce the financial and technical burden of developing digital health in developing countries. As a free and collaborative large-scale multilingual knowledge graph, Wikidata became a confirmed database that can represent multiple kinds of clinical information, particularly in the context of COVID-19. Its representation in the Resource Description Framework (RDF) format enables the flexible enrichment of biomedical information using computer programs and crowdsourcing, the intrinsic and extrinsic validation of clinical knowledge, and the extraction of features from the medical data for decision making and human and machine learning. Yet, Wikidata still lacks a full representation of several facets of biomedical informatics and its data suffers from critical inconsistencies. For instance, Wikidata items about genes, proteins, and drugs have an average of 10+ statements per item while anatomical structures have only an average of fewer than 4.6 statements per item. Furthermore, more than 90% of the Wikidata statements about human genes and proteins are supported by references whereas only less than 50% of the statements about the anatomical structures are assigned references. Moreover, the linguistic representation of biomedical entities in Wikidata is dominated by German and English when other natural languages are partially or rarely covered.
To solve these concerns, this project aims to:
- Turn Wikidata into a large-scale biomedical semantic resource covering most of the aspects of the clinical practice in a significant way (S1): This is allowed thanks to the development of bots and tools to mass import clinical information from external resources already aligned with Wikidata and to the creation of machine learning algorithms to extract clinical information from the full texts and bibliographic information of scholarly publications indexed in PubMed, a large-scale bibliographic database hosted by National Center for Biotechnology Information and National Institutes of Health. This implies the enrichment of the facets of biomedical knowledge represented in Wikidata and the support of new kinds of clinical information that were not covered by Wikidata during the last few years.
- Validate the biomedical information freely available in Wikidata (S2): This is enabled thanks to comparison to external resources through the use of semantic alignments between Wikidata items and external biomedical resources, and to intrinsic validation through the use of SPARQL for identifying mismatches between statements and the use of shape-based methods such as ShEx and SHACL as well as property constraints for verifying the accuracy of the formatting and data modeling of the clinical knowledge in Wikidata. These methods are coupled with the development of Wikidata Game-like human validation tools of medical information in Wikidata.
- Promote the biomedical use of Wikidata in the Global South (S3): This is permitted thanks to online capacity-building events for the biomedical community in Africa about Wikidata and its tools, to the publication of surveys and position papers on biomedical applications of Wikidata in recognized research journals. The integration of Wikidata into Fast Healthcare Interoperability Resources, a semantic system for driving Electronic Health Records, is also envisioned to enable the use of human-validated subsets of Wikidata information for clinical reasoning in the Global South.
This project only uses public data and text, and never touches private, restricted, or personal data or health information. The development of these three tasks will allow not only the significant amelioration of Wikidata as a secondary knowledge base for medical information but also the development of a framework for the curation of other famous medical knowledge graphs such as Disease Ontology. The reproducibility of the Project will allow the development of solutions for the enrichment of Wikimedia Projects with knowledge about other research areas such as social science and computer science from open resources, particularly knowledge graphs and bibliographic databases.
Describe in this section the methods you'll be using to conduct your research. If the project involves recruiting Wikimedia/Wikipedia editors for a survey or interview, please describe the suggested recruitment method and the size of the sample. Please include links to consent forms, survey/interview questions and user-interface mock-ups.
Using SPARQL and APIs for biomedical data validation and enrichment
As a query language for RDF knowledge graphs, SPARQL is a very useful tool for retrieving a particular piece of information including inconsistencies. As well, APIs are computer-friendly interfaces that allow computer programs to interact with open knowledge resources . As an open knowledge graph in the RDF format, Wikidata has a SPARQL Endpoint that allows to extract semantic data from Wikidata and represent them in a variety of layouts including tables, plots, graphs, and maps. This query service also allows federated queries with a number of external knowledge databases having SPARQL endpoints. Quest is a tool that allows to develop Quickstatements instructions to add, modify or delete Wikidata statements based on the output of a Wikidata SPARQL query. We can consequently use Quest as a tool that enriches and adjusts Wikidata based on logical constraints implemented in SPARQL and probably on query federation with other knowledge resources, particularly OBO Foundry. As well, Wikidata includes an API that can be processed using a Python Library entitled Wikibase Integrator. This library can be used to automatically read and edit Wikidata and can be consequently jointly used with other Python libraries like Biopython to enrich and adjust Wikidata through comparison with other knowledge resources like MeSH. In this research project, we will apply several SPARQL queries to Quest to automatically enrich and adjust Wikidata based on logical constraints and we will use Wikibase Integrator with other Python Libraries to add new Wikidata statements and add references to older ones through comparison to other knowledge resources.
Using ShEx for shape-based biomedical data validation
Mapping biomedical informatics and Wikidata research and development
For years, biomedical informatics research has evolved covering multiple aspects of the clinical practice and using the SOTA techniques such as machine learning, information retrieval, image and signal processing, big data, and pre-trained language models. Similarly, Wikidata research and development has been growing since 2012 to cover the changes in the multilingual and multidisciplinary coverage of the knowledge graph. Here, Bibliometrics can be very useful to assess research outputs about biomedical informatics and Wikidata research as it uses the bibliographic metadata of scholarly publications to provide insights about the patterns of research publishing for a community. As well, Empirical software engineering can apply empirical research methods on the characteristics of a set of code repositories, including source codes, pull requests, discussions, and issues, to study and evaluate software engineering behaviors related to the development of software tools related to a given topic. In this research project, we will extract bibliographic metadata for scholarly publications related to two research fields (Machine Learning for Healthcare in Africa and Wikidata Research) from Scopus, a controlled large-scale bibliographic database maintained by Elsevier. Then, we will analyze them using four techniques:
- Publishing Patterns and Time-Aware Analysis: We quantitatively analyze the most common values of every type of bibliographic metadata, including author information, source information, titles, abstracts, research fields, and open-access status. After that, we reproduce what has been done. However, we restrict our analysis to several periods so that we can assess the evolution of the research production in the considered area.
- Network Analysis: We consider four types of bibliographic associations: Citation, Co-Citation, Co-authorship, and Bibliographic Coupling. For every kind of association, we construct networks for authors, sources, countries, and documents in multiple periods to assess how the field has been structured. We use Total Link Strength weighting to consider better visualize the nodes that contributed more to the establishment of the bibliometric networks. We use VOSViewer, an open-source software for generating bibliometric networks from the data of bibliographic databases, to generate our data visualizations.
- Keyword Analysis: As author-generated and Scopus-generated keywords do not cover all the aspects of analyzed scholarly publications, we augment the data by extracting MeSH Keywords from PubMed if applicable using Biopython. As quite all the research papers are written in English, we also use SpaCy pre-trained models to extract noun phrases from titles and abstracts and add them to the list of keywords. When this work is finished, we align the keywords to their corresponding Wikidata items using OpenRefine, an open-source software for tabular data cleaning and reconciliation. After that, we generate the list of the most common keywords by type and period and we construct the keyword association networks for the field to study how a research topic interact with other ones.
After finishing this part, we will use the generated keywords for the research publications about machine learning for healthcare in Africa to classify the considered papers according to their research topics and then invite a number of African enthusiasts of machine learning to write an overview about the research works about every topic to develop a survey of the research field. Beyond this, we will extract detailed information about Wikidata and Wikipedia-related repositories in GitHub to assess how the Wikimedia Technical Community use these two Wikimedia Projects in tools and to find out what we should do to enhance the Wikimedia Technical Community, particularly in Africa.
MeSH Keywords for enriching clinical information in Wikidata
Currently, quite all the machine learning algorithms use the full texts of scholarly publications to extract biomedical relations. However, bibliographic metadata of research publications are easier to parse and more structured than full texts and provide significant insights about what the paper includes as research findings. Recently, a new field has emerged to allow the extraction of scientific knowledge from the bibliographic metadata of scholarly publications based on information retrieval, semantic web, and machine learning. This field is called Bibliometric-Enhanced Information Retrieval. In this research project, we mainly interest to the bibliographic metadata of biomedical scholarly publications indexed in PubMed, a bibliographic database of biomedical research publications maintained by NCBI. Particularly, the MeSH Keywords of PubMed scholarly publications are interesting bibliographic data that can be used to enrich clinical information in Wikidata. In fact, these keywords are controlled (derived from Medical Subject Headings) and have a particular layout (Heading/Qualifier), enabling their semantic alignment to Wikidata items using the MeSH Descriptor ID Property and their easy processing due to their data structure. Such an interaction between MeSH Keywords and Wikidata is ensured thanks to two Python Libraries: Biopython and Wikibase Integrator. We did a preliminary work that predicts the types of semantic relations between two MeSH Keywords based on the association of their qualifiers in PubMed. The classification algorithm returns a Wikidata property (195 relation types) as well as a first-order semantic relation type metaclass (5 superclasses) that corresponds to the analyzed relation. We achieved an accuracy of 70.78% for the class-based classification and of 83.09% for the superclass-based classification. In this research project, we envision to study the mechanism behind our biomedical relation classification approach using a generalization-based accuracy analysis as well as Integrated Gradients as model explainability methods. We will use the results to enhance our proposed approach named MeSH2Matrix so that it can become more accurate and to drive the search of references from PubMed for unsupported biomedical relations in Wikidata. Moreover, we will try to combine corpus-based semantic similarity measures with the MeSH2Matrix approach to extract semantic relations from the MeSH Keywords and add them to Wikidata using a Wikidata Game-like Toolforge Tool.
Expanding the coverage and real-world applications of biomedical knowledge in Wikidata
Currently, Wikidata is mainly used to support semantic information and develop educational dashboards about genomes, diseases, drugs and proteins. Yet, biomedical knowledge is broader than this and needs to be less distorted in Wikidata so that this open ontological database can be used in various contexts of the clinical practice:
- Despite the significant representation of biomedical knowledge in Wikidata, many classes and relation types are still not well covered in this open knowledge graph. This includes symptoms, syndromes, clinical trials, disease outbreaks, classifications, and surgeries among other types of important biomedical items. In this research project, we will try to add new Wikidata properties related to biomedicine as we did for risk factor and medical indication. We will also define new classes for unsupported types of medical entities and we define data models for describing them in Wikidata, as we did for Clinical Trials.
- Due to their structured format, knowledge graphs can be easily processed to extract features about a topic. This is enabled thanks to SPARQL as a query language or APIs as computer-friendly interfaces. Particularly, Wikidata API can be reused using Wikibase Integrator to drive knowledge-based systems. Similarly, the outputs of SPARQL queries can be embedded to HTML pages to create real-time dashboard for various applications. In this research project, we will explain how computer scientists and medical specialists can use Wikidata to improve their work through a series of opinion papers and implementation papers. The applications we will be dealing with include the use of Wikidata to:
- Support clinical decisions, health research, and medical education through driving FHIR RDF-based structured electronic health records.
- Create real-time bibliometric studies to evaluate health research and predict award-winning ones
- Augment keyword analysis for bibliometric analyses and literature reviews
- Biomedical Ontology Engineering
This project goes in line with three points of the 2030 Wikimedia Strategic Direction: “Coordinate Across Stakeholders”, “Increase the Sustainability of Our Movement” and “Identify Topics for Impact”. Developing a framework to update, enrich and validate Biomedical Knowledge in Wikidata will allow ensuring better data quality for Wikidata in the healthcare context. Such quality for a freely available resource will increase the trustworthiness of Wikidata as a reference for physicians, pharmacists, and other medical professionals. This will allow for better patient management and health education in the Global South. This will solve representation gaps related to medical content for content, contributors, and readers as defined in the knowledge gaps taxonomy. Extending the use of Wikidata for clinical practice will allow the creation of knowledge-based medical systems at a low cost. This will allow the achievement of three UN Sustainable Development Goals: “Good Health and Well-Being” (SDG 3), “Quality Education” (SDG 4), and “Sustainable Cities and Communities” (SDG 11). From the perspective of the Wikimedia Movement, the Project will be referential for Wikimedia affiliates and communities from Africa, particularly Wikimedia Tunisia and African Wikimedia Developers Project, if they would like to continue working on the medical output of Wikidata and create projects about biomedical applications of Wikidata or if they would like to formulate a research project and apply for the next editions of Wikimedia Research Fund.
To measure the success of our research project, several objective metrics can be used to evaluate the reach and productivity of our upcoming work:
- Number of scholarly publications in Scimago Q1 computer science and medical research journals: 3+
- Number of proceedings papers in the main tracks of CORE A or A* scholarly conferences: 1+
- Number of proceedings papers in the Workshops of CORE A or A* scholarly conferences: 2+
- Number of office hours: 6+
- Number of presentations in Wikimedia conferences: 3+
- Number of attendees to office hours: 30+ per session
For the dissemination of this Project, we envision publishing most of our research results in recognized scholarly journals as Open Access publications. We look forward to presenting our efforts in Wikimedia Research-related venues such as Wiki Workshop, Wikidata Workshop, and Wiki-M3L as well as in premier scholarly conferences for Knowledge Engineering and Machine Learning (CORE A*), particularly SIGIR and WWW. We will publish our source codes on GitHub under the MIT License for reproducibility purposes. We will participate in Wikimedia Conferences (e.g., WikiArabia, WikiIndaba, Wikimania, and WikidataCon) to disseminate the outcomes of our work to the Wikimedia Community. We will organize regular office hours where we demonstrate our tools live on YouTube and Zoom to the information retrieval, semantic web, biomedical informatics, and clinical medicine communities. All this work will be collaboratively done with the collaboration of Sisonkebiotik, a community of African Machine Learning for Healthcare Purposes.
Please provide in this section a short timeline with the main milestones and deliverables (if any) for this project.
|S1||S1.A1||Month 1 - Month 12: Enriching Wikidata with biomedical knowledge available in external resources
|S1.A2||Month 1 - Month 12: Enriching Wikidata with biomedical knowledge available in external resources
|S2||S2.A1||Month 4 - Month 10: Developing bots and tools for the cross-validation of Wikidata biomedical information from external resources
|S2.A2||Month 4 - Month 10: Developing SPARQL-based approaches for the intrinsic validation of Wikidata clinical knowledge
|S2.A3||Month 4 - Month 10: Developing EntitySchemas in ShEx for validating the shape and representation of medical concepts in Wikidata
|S3||S3.A1||Month 1 - Month 12: Organize office hours to demonstrate Wikidata and its medical outputs
|S3.A2||Month 1 - Month 12: Publishing scholarly publications about Wikidata-driven biomedical applications and about the management of biomedical information in Wikidata
Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.
|30 June 2022||Wikidata as a resource for enriching Medical Wikipedia (in Arabic, 18 attendees)||Arabic Wikipedia Day 2022|
|12 July 2022||Bibliometric-Enhanced Information Retrieval as a tool for enriching and validating Wikidata (in English, 65 attendees)||2022 LD4 Conference on Linked Data|| Slides|
|13 August 2022||Let us play with PubMed to enrich Wikidata with medical information (in English, thirty online attendees and eight in-person attendees)||2022 Wikimania Hackathon
Wikimania 2022 in Tunisia
|25-26 August 2022||Growing AI for Healthcare in Africa: Telling our story (in English, 118 attendees)
SisonkeBiotik: Africa Machine Learning and Health Workshop
|23 October 2022||Let us solve the mysteries behind Wikidata (in French, 16 participants including 8 attendees and 10 contributors to the tutorial)
Wikidata Tenth Birthday
|09 March 2023||Empowering Biomedical Informatics with Open Resources: Unleashing the Potential of Data and Tools (in English, 8 attendees)
The Stanford MedAI Group Exchange Sessions
|16 and 23 July 2022||Introduction to Wikidata, User Scripts, Wikidata Query Service, and OpenRefine (in French)|
Wiki Wake Up Afrique
|16 November 2022||Letter to the Editor: FHIR RDF - Why the world needs structured electronic health records||Journal of Biomedical Informatics|| Abstract|
- ↑ Callahan, T. J., Tripodi, I. J., Pielke-Lombardo, H., & Hunter, L. E. (2020). Knowledge-based biomedical data science. Annual review of biomedical data science, 3, 23-41. PMC:8095730.
- ↑ a b c d e f g h i j Turki, H., Shafee, T., Hadj Taieb, M. A., Ben Aouicha, M., Vrandečić, D., Das, D., & Hamdi, H. (2019). Wikidata: A large-scale collaborative ontological medical database. Journal of biomedical informatics, 99, 103292. doi:10.1016/j.jbi.2019.103292.
- ↑ Sun, H., Depraetere, K., De Roo, J., Mels, G., De Vloed, B., Twagirumukiza, M., & Colaert, D. (2015). Semantic processing of EHR data for clinical research. Journal of biomedical informatics, 58, 247-259. doi:10.1016/j.jbi.2015.10.009.
- ↑ a b Kang, N., Singh, B., Bui, C., Afzal, Z., van Mulligen, E. M., & Kors, J. A. (2014). Knowledge-based extraction of adverse drug events from biomedical text. BMC Bioinformatics, 15(1), 64:1-64:8. doi:10.1186/1471-2105-15-64.
- ↑ Hong, G., Kim, Y., Choi, Y., & Song, M. (2021). BioPREP: Deep learning-based predicate classification with SemMedDB. Journal of Biomedical Informatics, 122, 103888. doi:10.1016/j.jbi.2021.103888.
- ↑ Nicholson, N. C., Giusti, F., Bettio, M., Negrao Carvalho, R., Dimitrova, N., Dyba, T., et al. (2021). An ontology-based approach for developing a harmonised data-validation tool for European cancer registration. Journal of Biomedical Semantics, 12(1), 1:1-1:15. doi:10.1186/s13326-020-00233-x.
- ↑ Slater, L. T., Bradlow, W., Ball, S., Hoehndorf, R., & Gkoutos, G. V. (2021). Improved characterisation of clinical text through ontology-based vocabulary expansion. Journal of Biomedical Semantics, 12(1), 7:1-7:9. doi:10.1186/s13326-021-00241-5.
- ↑ Tehrani, F. T., & Roum, J. H. (2008). Intelligent decision support systems for mechanical ventilation. Artificial Intelligence in Medicine, 44(3), 171-182. doi:10.1016/j.artmed.2008.07.006.
- ↑ Odekunle, F. F., Odekunle, R. O., & Shankar, S. (2017). Why sub-Saharan Africa lags in electronic health record adoption and possible strategies to increase its adoption in this region. International Journal of Health Sciences, 11(4), 59. PMC:5654179.
- ↑ a b c d e Turki, H., Hadj Taieb, M. A., Shafee, T., Lubiana, T., Jemielniak, D., Ben Aouicha, M., ... & Mietchen, D. (2021). Representing COVID-19 information in collaborative knowledge graphs: the case of Wikidata. Semantic Web, 13(2), 233-264. doi:10.3233/SW-210444.
- ↑ a b c Turki, H., Jemielniak, D., Hadj Taieb, M. A., Labra Gayo, J. E., Ben Aouicha, M., Banat, M., ... & Mietchen, D. (2022). Using logical constraints to validate statistical information about COVID-19 in collaborative knowledge graphs: the case of Wikidata. PeerJ Computer Science, 8, e1085. doi:10.7717/peerj-cs.1085.
- ↑ a b Labra Gayo, J. E. (2022). WShEx: A language to describe and validate Wikibase entities. In Proceedings of the 3rd Wikidata Workshop 2022 (Wikidata 2022) (pp. 4:1-4:12). Hangzhou, China: CEUR-WS.org. doi:10.48550/arXiv.2208.02697
- ↑ a b Tran, B. X., Vu, G. T., Ha, G. H., Vuong, Q. H., Ho, M. T., Vuong, T. T., et al. (2019). Global evolution of research in artificial intelligence in health and medicine: a bibliometric study. Journal of clinical medicine, 8(3), 360. doi:10.3390/jcm8030360
- ↑ Mora-Cantallops, M., Sánchez-Alonso, S., & García-Barriocanal, E. (2019). A systematic literature review on Wikidata. Data Technologies and Applications, 53(3), 250-268. doi:10.1108/DTA-12-2018-0110
- ↑ Prana, G. A. A., Treude, C., Thung, F., Atapattu, T., & Lo, D. (2019). Categorizing the content of github readme files. Empirical Software Engineering, 24(3), 1296-1327. doi:10.1007/s10664-018-9660-3
- ↑ Van Eck, N., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523-538. doi:10.1007/s11192-009-0146-3
- ↑ a b c d e f g h i j Turki, H., Dossou, B. F. P., Emezue, C. C., Hadj Taieb, M. A., Ben Aouicha, M., Ben Hassen, H., & Masmoudi, A. (2022). MeSH2Matrix: Machine learning-driven biomedical relation classification based on the MeSH keywords of PubMed scholarly publications. In BIR 2022: 12th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2022 (pp. 45-60). Stavanger, Norway: CEUR-WS.org. https://ceur-ws.org/Vol-3230/paper-07.pdf
- ↑ Vasiliev, Y. (2020). Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press.
- ↑ Delpeuch, A. (2020). A Survey of OpenRefine Reconciliation Services. In Proceedings of the 15th International Workshop on Ontology Matching co-located with the 19th International Semantic Web Conference (ISWC 2020) (pp. 82-86). Athens, Greece: CEUR-WS.org. https://ceur-ws.org/Vol-2788/om2020_STpaper3.pdf
- ↑ Turki, H., Hadj Taieb, M. A., & Ben Aouicha, M. (2022). How knowledge-driven class generalization affects classical machine learning algorithms for mono-label supervised classification. In International Conference on Intelligent Systems Design and Applications (pp. 637-646). Springer, Cham. doi:10.1007/978-3-030-96308-8_59
- ↑ Sundararajan, M., Taly, A., & Yan, Q. (2017, July). Axiomatic attribution for deep networks. In International conference on machine learning (pp. 3319-3328). PMLR. http://proceedings.mlr.press/v70/sundararajan17a.html
- ↑ a b Rasberry, L., Tibbs, S., Hoos, W., Westermann, A., Keefer, J., Baskauf, S. J., ... & Mietchen, D. (2022). WikiProject Clinical Trials for Wikidata. medRxiv. doi:10.1101/2022.04.01.22273328v1.