Jump to content

Research:Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning

From Meta, a Wikimedia project coordination wiki
Created
20:05, 30 May 2022 (UTC)
Duration:  2022-08 – 2024-09
Wikidata, Knowledge Graph Validation, Biomedical relation classification, Biomedical relation extraction



Grant ID: G-RS-2204-08368
This page documents a completed research project.


Description

[edit]

Nowadays, semantic resources have been proven as efficient to drive computer applications in a variety of domains, particularly healthcare[1]. Semantic resources provide detailed knowledge about various aspects of medicine including diseases, drugs, genes, and proteins[2] and they can consequently be used to retrieve, process, and represent clinical information including electronic health records[3] and scholarly publications[4]. These databases enable biomedical relation extraction[4], biomedical relation classification[5], biomedical data validation[6], biomedical data augmentation[7], and biomedical decision support[8]. However, the implementation of such biomedical resources in the Global South, particularly in Africa, is still limited due to the lack of consistent funding and human capacities[9]. Here, open knowledge graphs, particularly Wikidata, can be valuable to reduce the financial and technical burden of developing digital health in developing countries[2]. As a free and collaborative large-scale multilingual knowledge graph, Wikidata became a confirmed database that can represent multiple kinds of clinical information, particularly in the context of COVID-19[10]. Its representation in the Resource Description Framework (RDF) format enables the flexible enrichment of biomedical information using computer programs and crowdsourcing, the intrinsic and extrinsic validation of clinical knowledge, and the extraction of features from the medical data for decision making and human and machine learning[10]. Yet, Wikidata still lacks a full representation of several facets of biomedical informatics[2] and its data suffers from critical inconsistencies[11]. For instance, Wikidata items about genes, proteins, and drugs have an average of 10+ statements per item while anatomical structures have only an average of fewer than 4.6 statements per item[2]. Furthermore, more than 90% of the Wikidata statements about human genes and proteins are supported by references whereas only less than 50% of the statements about the anatomical structures are assigned references[2]. Moreover, the linguistic representation of biomedical entities in Wikidata is dominated by German and English when other natural languages are partially or rarely covered[10].

To solve these concerns, this project aims to:

  • Turn Wikidata into a large-scale biomedical semantic resource covering most of the aspects of the clinical practice in a significant way (S1): This is allowed thanks to the development of bots and tools to mass import clinical information from external resources already aligned with Wikidata and to the creation of machine learning algorithms to extract clinical information from the full texts and bibliographic information of scholarly publications indexed in PubMed, a large-scale bibliographic database hosted by National Center for Biotechnology Information and National Institutes of Health. This implies the enrichment of the facets of biomedical knowledge represented in Wikidata and the support of new kinds of clinical information that were not covered by Wikidata during the last few years.
  • Validate the biomedical information freely available in Wikidata (S2): This is enabled thanks to comparison to external resources through the use of semantic alignments between Wikidata items and external biomedical resources, and to intrinsic validation through the use of SPARQL for identifying mismatches between statements and the use of shape-based methods such as ShEx and SHACL as well as property constraints for verifying the accuracy of the formatting and data modeling of the clinical knowledge in Wikidata. These methods are coupled with the development of Wikidata Game-like human validation tools of medical information in Wikidata.
  • Promote the biomedical use of Wikidata in the Global South (S3): This is permitted thanks to online capacity-building events for the biomedical community in Africa about Wikidata and its tools, to the publication of surveys and position papers on biomedical applications of Wikidata in recognized research journals. The integration of Wikidata into Fast Healthcare Interoperability Resources, a semantic system for driving Electronic Health Records, is also envisioned to enable the use of human-validated subsets of Wikidata information for clinical reasoning in the Global South.

This project only uses public data and text, and never touches private, restricted, or personal data or health information. The development of these three tasks will allow not only the significant amelioration of Wikidata as a secondary knowledge base for medical information but also the development of a framework for the curation of other famous medical knowledge graphs such as Disease Ontology. The reproducibility of the Project will allow the development of solutions for the enrichment of Wikimedia Projects with knowledge about other research areas such as social science and computer science from open resources, particularly knowledge graphs and bibliographic databases.

Methods

[edit]

Describe in this section the methods you'll be using to conduct your research. If the project involves recruiting Wikimedia/Wikipedia editors for a survey or interview, please describe the suggested recruitment method and the size of the sample. Please include links to consent forms, survey/interview questions and user-interface mock-ups.

Principles

[edit]

Research

[edit]

Using SPARQL and APIs for biomedical data validation and enrichment

As a query language for RDF knowledge graphs, SPARQL is a very useful tool for retrieving a particular piece of information including inconsistencies[11]. As well, APIs are computer-friendly interfaces that allow computer programs to interact with open knowledge resources[2]. As an open knowledge graph in the RDF format, Wikidata has a SPARQL Endpoint that allows to extract semantic data from Wikidata and represent them in a variety of layouts including tables, plots, graphs, and maps. This query service also allows federated queries with a number of external knowledge databases having SPARQL endpoints. Quest is a tool that allows to develop Quickstatements instructions to add, modify or delete Wikidata statements based on the output of a Wikidata SPARQL query. We can consequently use Quest as a tool that enriches and adjusts Wikidata based on logical constraints implemented in SPARQL and probably on query federation with other knowledge resources, particularly OBO Foundry. As well, Wikidata includes an API that can be processed using a Python Library entitled Wikibase Integrator. This library can be used to automatically read and edit Wikidata and can be consequently jointly used with other Python libraries like Biopython to enrich and adjust Wikidata through comparison with other knowledge resources like MeSH. In this research project, we will apply several SPARQL queries to Quest to automatically enrich and adjust Wikidata based on logical constraints and we will use Wikibase Integrator with other Python Libraries to add new Wikidata statements and add references to older ones through comparison to other knowledge resources.

Using ShEx for shape-based biomedical data validation

Shape Expressions (ShEx) is a semantic web language that describes RDF graph structures[12]. It has been extended to validate Wikidata statements based on shape-based constraints[12]. Currently, there is a database of ShEx-based EntitySchemas that are used to validate the data model of a particular category of Wikidata items. As well, there is a JavaScript tool to validate a Wikidata item according to an EntitySchema named CheckShEx. Despite these significant advances, most of the Wikidata classes related to biomedicine are not supported by ShEx-based EntitySchemas. A list of currently available EntitySchemas can be found here. WikiProject COVID-19 tried to bridge this gap through the development of a series of EntitySchemas related to medicine. The list of COVID-19-related EntitySchemas is accessible here. However, a large number of biomedical classes still lacks EntitySchemas, particularly the ones not linked to COVID-19 and infectious diseases. Furthermore, there is not an automated way to link between a Wikidata item and its respective EntitySchemas for validation purposes. A Wikidata property has been proposed to solve the problem. However, the property proposal is still on hold. Further information can be found at the proposal page. The problem with such an approach is that it cannot be scaled to support a group of Wikidata items that are defined based on conditions beyond instance of relations. Here, we propose to manually add EntitySchemas to support all kinds of biomedical Wikidata entities. We also try to develop two tools to enhance how EntitySchemas are defined and reused across Wikidata. The first tool tries to create new EntitySchemas based on already existing ones using Embeddings. The second tool infers the EntitySchemas corresponding to a Wikidata item for an automatic validation of the Knowledge Graph.

Mapping biomedical informatics and Wikidata research and development

For years, biomedical informatics research has evolved covering multiple aspects of the clinical practice and using the SOTA techniques such as machine learning, information retrieval, image and signal processing, big data, and pre-trained language models[13]. Similarly, Wikidata research and development has been growing since 2012 to cover the changes in the multilingual and multidisciplinary coverage of the knowledge graph[14]. Here, Bibliometrics can be very useful to assess research outputs about biomedical informatics and Wikidata research as it uses the bibliographic metadata of scholarly publications to provide insights about the patterns of research publishing for a community[13]. As well, Empirical software engineering can apply empirical research methods on the characteristics of a set of code repositories, including source codes, pull requests, discussions, and issues, to study and evaluate software engineering behaviors related to the development of software tools related to a given topic[15]. In this research project, we will extract bibliographic metadata for scholarly publications related to two research fields (Machine Learning for Healthcare in Africa and Wikidata Research) from Scopus, a controlled large-scale bibliographic database maintained by Elsevier. Then, we will analyze them using four techniques:

  • Publishing Patterns and Time-Aware Analysis: We quantitatively analyze the most common values of every type of bibliographic metadata, including author information, source information, titles, abstracts, research fields, and open-access status. After that, we reproduce what has been done. However, we restrict our analysis to several periods so that we can assess the evolution of the research production in the considered area.
  • Network Analysis: We consider four types of bibliographic associations: Citation, Co-Citation, Co-authorship, and Bibliographic Coupling. For every kind of association, we construct networks for authors, sources, countries, and documents in multiple periods to assess how the field has been structured. We use Total Link Strength weighting to consider better visualize the nodes that contributed more to the establishment of the bibliometric networks. We use VOSViewer, an open-source software for generating bibliometric networks from the data of bibliographic databases, to generate our data visualizations[16].
  • Keyword Analysis: As author-generated and Scopus-generated keywords do not cover all the aspects of analyzed scholarly publications, we augment the data by extracting MeSH Keywords from PubMed if applicable using Biopython[17]. As quite all the research papers are written in English, we also use SpaCy pre-trained models to extract noun phrases from titles and abstracts and add them to the list of keywords[18]. When this work is finished, we align the keywords to their corresponding Wikidata items using OpenRefine[19], an open-source software for tabular data cleaning and reconciliation. After that, we generate the list of the most common keywords by type and period and we construct the keyword association networks for the field to study how a research topic interact with other ones.

After finishing this part, we will use the generated keywords for the research publications about machine learning for healthcare in Africa to classify the considered papers according to their research topics and then invite a number of African enthusiasts of machine learning to write an overview about the research works about every topic to develop a survey of the research field. Beyond this, we will extract detailed information about Wikidata and Wikipedia-related repositories in GitHub to assess how the Wikimedia Technical Community use these two Wikimedia Projects in tools and to find out what we should do to enhance the Wikimedia Technical Community, particularly in Africa.

MeSH Keywords for enriching clinical information in Wikidata

Currently, quite all the machine learning algorithms use the full texts of scholarly publications to extract biomedical relations[17]. However, bibliographic metadata of research publications are easier to parse and more structured than full texts and provide significant insights about what the paper includes as research findings[17]. Recently, a new field has emerged to allow the extraction of scientific knowledge from the bibliographic metadata of scholarly publications based on information retrieval, semantic web, and machine learning. This field is called Bibliometric-Enhanced Information Retrieval[17]. In this research project, we mainly interest to the bibliographic metadata of biomedical scholarly publications indexed in PubMed, a bibliographic database of biomedical research publications maintained by NCBI[17]. Particularly, the MeSH Keywords of PubMed scholarly publications are interesting bibliographic data that can be used to enrich clinical information in Wikidata[17]. In fact, these keywords are controlled (derived from Medical Subject Headings) and have a particular layout (Heading/Qualifier), enabling their semantic alignment to Wikidata items using the MeSH Descriptor ID Property and their easy processing due to their data structure. Such an interaction between MeSH Keywords and Wikidata is ensured thanks to two Python Libraries: Biopython and Wikibase Integrator. We did a preliminary work that predicts the types of semantic relations between two MeSH Keywords based on the association of their qualifiers in PubMed[17]. The classification algorithm returns a Wikidata property (195 relation types) as well as a first-order semantic relation type metaclass (5 superclasses) that corresponds to the analyzed relation[17]. We achieved an accuracy of 70.78% for the class-based classification and of 83.09% for the superclass-based classification[17]. In this research project, we envision to study the mechanism behind our biomedical relation classification approach using a generalization-based accuracy analysis[20] as well as Integrated Gradients[21] as model explainability methods. We will use the results to enhance our proposed approach named MeSH2Matrix so that it can become more accurate and to drive the search of references from PubMed for unsupported biomedical relations in Wikidata. Moreover, we will try to combine corpus-based semantic similarity measures with the MeSH2Matrix approach to extract semantic relations from the MeSH Keywords and add them to Wikidata using a Wikidata Game-like Toolforge Tool.

Expanding the coverage and real-world applications of biomedical knowledge in Wikidata

Currently, Wikidata is mainly used to support semantic information and develop educational dashboards about genomes, diseases, drugs and proteins[2]. Yet, biomedical knowledge is broader than this and needs to be less distorted in Wikidata so that this open ontological database can be used in various contexts of the clinical practice[2]:

  • Despite the significant representation of biomedical knowledge in Wikidata, many classes and relation types are still not well covered in this open knowledge graph[2]. This includes symptoms, syndromes, clinical trials, disease outbreaks, classifications, and surgeries among other types of important biomedical items[2]. In this research project, we will try to add new Wikidata properties related to biomedicine as we did for risk factor and medical indication. We will also define new classes for unsupported types of medical entities and we define data models for describing them in Wikidata, as we did for Clinical Trials[22].
  • Due to their structured format, knowledge graphs can be easily processed to extract features about a topic. This is enabled thanks to SPARQL as a query language or APIs as computer-friendly interfaces. Particularly, Wikidata API can be reused using Wikibase Integrator to drive knowledge-based systems[10]. Similarly, the outputs of SPARQL queries can be embedded to HTML pages to create real-time dashboard for various applications[10]. In this research project, we will explain how computer scientists and medical specialists can use Wikidata to improve their work through a series of opinion papers and implementation papers. The applications we will be dealing with include the use of Wikidata to:
    • Support clinical decisions, health research, and medical education through driving FHIR RDF-based structured electronic health records.
    • Create real-time bibliometric studies to evaluate health research and predict award-winning ones
    • Augment keyword analysis for bibliometric analyses and literature reviews
    • Biomedical Ontology Engineering

Dissemination

[edit]

This project goes in line with three points of the 2030 Wikimedia Strategic Direction: “Coordinate Across Stakeholders”, “Increase the Sustainability of Our Movement” and “Identify Topics for Impact”. Developing a framework to update, enrich and validate Biomedical Knowledge in Wikidata will allow ensuring better data quality for Wikidata in the healthcare context. Such quality for a freely available resource will increase the trustworthiness of Wikidata as a reference for physicians, pharmacists, and other medical professionals. This will allow for better patient management and health education in the Global South. This will solve representation gaps related to medical content for content, contributors, and readers as defined in the knowledge gaps taxonomy. Extending the use of Wikidata for clinical practice will allow the creation of knowledge-based medical systems at a low cost. This will allow the achievement of three UN Sustainable Development Goals: “Good Health and Well-Being” (SDG 3), “Quality Education” (SDG 4), and “Sustainable Cities and Communities” (SDG 11). From the perspective of the Wikimedia Movement, the Project will be referential for Wikimedia affiliates and communities from Africa, particularly Wikimedia Tunisia and African Wikimedia Developers Project, if they would like to continue working on the medical output of Wikidata and create projects about biomedical applications of Wikidata or if they would like to formulate a research project and apply for the next editions of Wikimedia Research Fund.

To measure the success of our research project, several objective metrics can be used to evaluate the reach and productivity of our upcoming work:

  • Number of scholarly publications in Scimago Q1 computer science and medical research journals: 3+ Not done. But, we published 3 scholarly publications in Q2 or better.
  • Number of proceedings papers in the main tracks of CORE A or A* scholarly conferences: 1+ Not done. But, we successfully published one CORE B conference paper (LREC-COLING 2024).
  • Number of proceedings papers in the Workshops of CORE A or A* scholarly conferences: 2+ Done
  • Number of office hours: 6+ Done
  • Number of presentations in Wikimedia conferences: 3+ Done
  • Number of attendees to office hours: 30+ per session Done

For the dissemination of this Project, we envision publishing most of our research results in recognized scholarly journals as Open Access publications. We look forward to presenting our efforts in Wikimedia Research-related venues such as Wiki Workshop, Wikidata Workshop, and Wiki-M3L as well as in premier scholarly conferences for Knowledge Engineering and Machine Learning (CORE A*), particularly SIGIR and WWW. We will publish our source codes on GitHub under the MIT License for reproducibility purposes. We will participate in Wikimedia Conferences (e.g., WikiArabia, WikiIndaba, Wikimania, and WikidataCon) to disseminate the outcomes of our work to the Wikimedia Community. We will organize regular office hours where we demonstrate our tools live on YouTube and Zoom to the information retrieval, semantic web, biomedical informatics, and clinical medicine communities. All this work will be collaboratively done with the collaboration of Sisonkebiotik, a community of African Machine Learning for Healthcare Purposes.

Timeline

[edit]

Please provide in this section a short timeline with the main milestones and deliverables (if any) for this project.

Aim Task Description
S1 S1.A1 Month 1 - Month 12: Enriching Wikidata with biomedical knowledge available in external resources
  • Developing Wikidata bots and tools that use semantic alignments (e.g., Disease Ontology ID) between Wikidata items and their equivalents in external resources to extract and validate semantic relations between Wikidata items from these resources and mass import them to the Wikidata knowledge graph.
  • Applying machine-learning models, semantic similarity, and natural language processing techniques on bibliographic metadata available in PubMed, mainly the MeSH keywords, to extract and classify biomedical relations between Wikidata items. MeSH2Matrix has been developed by Sisonkebiotik and Data Engineering and Semantics Research Unit as an approach for the MeSH keyword-based classification of Wikidata relations[17]. MeSH2Matrix will be used as a pillar for applying Bibliometric-Enhanced Information Retrieval to automatically enrich and validate biomedical relations in Wikidata.
S1.A2 Month 1 - Month 12: Expanding the coverage of the biomedical knowledge in Wikidata
  • Adding new properties and classes based on our experience in this context. We have already proposed new Wikidata properties (e.g., risk factor and medical indication) and we have also added support for new Wikidata classes from scratch (e.g., COVID-19 app). We can reproduce this experience when needed. An example is the inclusion of knowledge about clinical trials in the Wikidata knowledge graph within the framework of WikiProject Clinical Trials[22].
S2 S2.A1 Month 4 - Month 10: Developing bots and tools for the cross-validation of Wikidata biomedical information from external resources
  • This work will be a development of the efforts of Wikimedia Deutschland in this context, particularly the Reference Island Project that assigns a reference to a Wikidata statement when it exists in an external knowledge resource. The bots will verify the availability of Wikidata statements in external resources (e.g., Disease Ontology) based on semantic alignments (e.g., Disease Ontology ID) in Wikidata. Toolforge tools will be based on the analysis of bibliographic metadata of scholarly publications coupled with human validation to decide whether a biomedical relation in Wikidata is accurate or not. RefB funded by WikiCred Grant Initiative is an example of preliminary work regarding the use of PubMed data mining for validating and adding reference support to Wikidata biomedical relations.
S2.A2 Month 4 - Month 10: Developing SPARQL-based approaches for the intrinsic validation of Wikidata clinical knowledge
  • Despite the usefulness of Shape-based methods (ShEx and SHACL), they would not allow the verification of Wikidata statements through the comparison of their values. Here, SPARQL can be used to identify mismatches between the values of two statements. In the context of the COVID-19 pandemic, we developed a SPARQL-based method for the validation of the epidemiological data about the disease[11]. We look forward to developing our work to cover other biomedical use cases of SPARQL-based validation that cannot be fulfilled by Shape-based methods.
S2.A3 Month 4 - Month 10: Developing EntitySchemas in ShEx for validating the shape and representation of medical concepts in Wikidata
  • We will build upon the efforts of WikiProject COVID-19 to develop data models for supporting biomedical entities related to the ongoing pandemic. We will reuse the data modeling output of WikiProject COVID-19 and extend it to cover other aspects of clinical practice.
S3 S3.A1 Month 1 - Month 12: Organize office hours to demonstrate Wikidata and its medical outputs
  • We will build upon the success of our previous presentations in Wikimedia Conferences (e.g., Wikimania 2019 and WikiArabia 2021) on the matter and customize our materials to let them more adapted to the healthcare industry and the computer science community in Africa.
  • We will deal with the technical side of reusing Wikidata in intelligent systems for the clinical practice that was not evocated in our previous presentations and that was used in our research on the topic (e.g., Wikibase Integrator, MediaWiki API, and SPARQL). We will also show examples of several clinical applications where Wikidata can be very useful based on what we have presented in Wikimedia conferences and peer-reviewed research venues.
S3.A2 Month 1 - Month 12: Publishing scholarly publications about Wikidata-driven biomedical applications and about the management of biomedical information in Wikidata
  • Develop a roadmap for integrating Wikidata with Fast Healthcare Interoperability Resources and other open knowledge resources to enable the reuse of Wikidata in knowledge-based biomedical systems
  • Publishing research works about how to practically use Wikidata in the clinical context and publish them in indexed scholarly journals
  • Publishing research works about how to manage clinical information in Wikidata and publish them in indexed scholarly journals

Results

[edit]

Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.

Enriching Wikidata with biomedical knowledge available in external resources (S1.A1)

[edit]

In the context of this research work, we compared the external identifiers of Wikidata items with the entities in Open Biomedical Ontologies and Medical Subject Headings and we found that a significant part of these external identifiers is either inaccurate or missing[23]. The rate of missing items in Wikidata can reach more than 90% for ontologies such as Human Phenotype Ontology and Vaccine Ontology[23]. The rate of inaccurate IDs in Wikidata can achieve more than 90% for ontologies like Symptom Ontology, Cell Line Ontology, and Cell Ontology[23]. A tool that generates a list of inaccurate IDs and missing items in Wikidata has been made available at https://github.com/SisonkeBiotik-Africa/MeSH2Wikidata[23]. Furthermore, we used the semantic alignment between Wikidata and Medical Subject Headings (MeSH) to retrieve semantic relations from the MeSH keywords of PubMed scholarly publications[23]. We used Pointwise Mutual Information (PMI), a corpus-based semantic similarity measure, to identify relevant associations between MeSH keywords[23]. We set the threshold for PMI as 2 based on the analysis of the PMI values for Wikidata semantic relations[23]. We consequently identified 835,111 new relations between the 5,000 most common MeSH terms. Then, we used the MeSH qualifiers, i.e., the attributes of the MeSH keywords involved in the associations (e.g., drug therapy in hepatitis C/drug therapy), to classify the identified relations[24]. This is enabled through the creation of a matrix of the correspondence between subject qualifiers and object qualifiers for every Wikidata relation and use it to train a dense model to classify semantic relations[24]. Based on the explainability analysis of our machine-learning algorithm driven by the Integrated Gradients, we reduced the list of considered MeSH qualifiers from 76 to 30 and we eliminated non-biomedical relations from the training set[23]. We found an accuracy of 75.32% and an F1-Score of 73.51% for relation type-based classification. We also reproduced the same classification for the superclasses of the Wikidata relation types (i.e., Symmetric, Non-Symmetric, and Taxonomic) and we had an accuracy of 89.40% and an F1-Score of 89.43%. We found that the lack of agreement between relation type-based classification and superclass-based classification can help solve 93.1% of the inconsistencies in the classification algorithm. That is why we decided to classify the new relations based on relation types and superclasses and adopt the output of the classification when there is an agreement between the assigned relation type and superclass. When this was applied to the new relations, we identified 255,699 new classified relations supported by PubMed references. Among these semantic relations, 93,197 (36.4%) are subclass of [P279] relations, 26,820 (10.5%) are instance of [P31] relations, 24,212 (9.5%) are cell component [P681] relations, 16,655 (6.5%) are biological process [P682] relations, 16,624 (6.5%) are found in taxon [P703] relations, 10,378 (4.1%) are health specialty [P1995] relations, 9,834 (3.8%) are anatomical location [P927] relations, 7,729 (3.0%) are medical condition treated [P2175] relations, 6,953 (2.7%) are drug or therapy used for treatment [P2176] relations, and 6,432 (2.5%) are significant drug interaction [P769] relations. It is up to the Wikimedia community to verify these relations and add them to Wikidata.

Expanding the coverage of the biomedical knowledge in Wikidata (S1.A2)

[edit]

The first part of this work tries to define a set of guidelines for adding support of a new type of medical information in Wikidata through the development of the coverage of clinical trials in the open knowledge graph[22]. This involves practical steps about how to collaboratively create a practical data model for the items through property proposals and community discussions and about how to reuse available data models from Wikidata or other resources to save time in data modeling[22]. This part also includes the basics of managing a community (i.e., WikiProject) for adding and validating support of the new type of information and creating bots to mass import public domain information from external resources[22]. Finally, this part proposes a framework for monitoring the success of the WikiProject through a set of SPARQL queries quantifying the status and evolution of clinical trial knowledge in Wikidata[22]. This part builds upon preliminary works on developing a WikiProject for Clinical Trials and it is currently available here. As of August 26, 2023, there are 391,916 clinical trials (including ones that have been published in scholarly journals) and 20 clinical trial registries in Wikidata, proving the efficiency of this approach. The second part of the work is to filter the associations between the 5,000 most common MeSH keywords that exceed the threshold for pointwise mutual information (PMI) for identifying the relations that cannot be assigned existing relation types from Wikidata[23]. These relations probably represent missing types of biomedical knowledge in Wikidata[23]. When applying this approach to the PMI-generated new associations, we found that 579,412 (69.3%) semantic relations are missing relation types. 528,005 (91.1% of unclassified relations) of these relations are not classified because they cannot have matrices of MeSH qualifiers while 51,407 (8.9% of unclassified relations) are not assigned relation types due to conflicting outputs of the superclass-based classification and the relation type-based classification. The analysis of the MeSH Tree Code [P672] statements of the subjects and objects of these relations reveals that most of them link between entities from the same class, particularly Chemical and Drugs (D, 56,846 associations), Diseases (C, 18,167 associations), Health Care (N, 13,860 associations), and Biological Sciences (G, 11,996 associations). These relations can be taxonomic ones. Several associations between different classes also exist, mostly the ones between Biological Sciences (G) and Chemical and Drugs (D), the ones between Anatomy (A) and Chemical and Drugs (D), the ones between Analytical, Diagnostic and Therapeutic Techniques and Equipment (E) and Chemical and Drugs (D), and the ones between Diseases (C) and Chemical and Drugs (D). It is up to the community to see whether these relations correspond to unused Wikidata properties or whether they need to create new Wikidata properties to support them. To make the work of the community easier, we used lightweight LLMs (particularly Llama and Phi) to go through the relations and eliminate the relations that are not plausible to the LLMs (probability of TRUE token < 0.8). The final refined list is available at https://github.com/csisc/ValiRel/tree/main/phi.

Developing bots and tools for the cross-validation of Wikidata biomedical information from external resources (S2.A1)

[edit]

The first facet of this work is to verify the accuracy of external identifiers to Open Biomedical Ontologies[23]. We found a significant deficiency in the formatting of the external identifiers of several OBO ontologies in Wikidata, particularly Symptom Ontology, Uberon, Cell Line Ontology, Cell Ontology, and Vaccine Ontology. The definition of Wikidata properties to link between Wikidata and OBO should be significantly revised[23]. We also assessed the completeness of the semantic alignment between OBO ontologies and Wikidata and identified that most of the OBO ontology concepts are not linked to Wikidata items[23]. We developed a Python-based user interface to help the community solve this important problem[23]. The second facet of this work is to verify unsupported statements in Wikidata by finding references about them in PubMed based on the association of MeSH keywords corresponding to the subjects and objects of the considered semantic relations[23]. We restricted our search to recent reviews as these kinds of publications are the ones having the highest level of evidence in clinical practice[23]. Notably, we found that 45.8% of these relations lack support in PubMed, potentially due to either Wikidata's collaborative editing leading to inconsistencies or the novelty/specificity of certain biomedical facts[23]. This novelty/specificity barrier can be easily bypassed by expanding the list of considered research types to include biomedical articles. Additionally, 42.0% of unsupported relations can be verified by three or more PubMed publications, affirming the efficacy of MeSH keywords for fact-checking semantic connections, particularly taxonomic and generic relation types[23]. The third facet of the work is to identify irrelevant semantic relations between the MeSH keywords based on their reduced value of pointwise mutual information (PMI) in PubMed[23]. We consider the mode of the integer parts of the PMI values of the Wikidata relations as the PMI threshold[23]. We identify the Wikidata relations needing human attention as those with PMI values below the threshold[23]. We find that 12,898 out of 109,302 Wikidata relations between MeSH keywords (11.8%) cannot be verified as they involve Wikidata items with wrong MeSH Descriptor IDs (3243 items - 8.3%)[23]. We also find that the PMI threshold is 2 and that 40,725 out of 109,302 Wikidata relations between MeSH keywords (37.2%) are below the PMI threshold and need human attention as they can be wrong ones[23]. As the list of 400,725 relations to be verified is highly beyond the human capacities to edit biomedical knowledge in Wikidata, we used lightweight LLMs (particularly Llama and Phi) to go through the relations and eliminate the relations that are plausible to the LLMs (probability of TRUE token > 0.8). The final refined list is available at https://github.com/csisc/ValiRel/tree/main/phi.

Integrating SPARQL and ShEx for the intrinsic validation of Wikidata clinical knowledge (S2.A2 and S2.A3)

[edit]

So far, Shape Expressions (ShEx) have been proven to be efficient in validating the layout of an RDF knowledge graph[11]. This implies that ShEx can be very useful to identify whether a Wikidata item is well defined in the form of triples, particularly from the perspective of completeness[11]. Evaluating other aspects of open knowledge graphs such as consistency requires the implication of other validation techniques[11]. One of the methods that have been proposed in previous research works is the combination of ShEx with Shapes Constraint Language (SHACL), an RDF validation language that defines conditions for the evaluation of research works[11]. In practice, the use of Shape Expressions (ShEx) in a knowledge graph is based on explicitly defining a set of conditions that explain what are the entities that should be validated using a given schema[25]. In the context of Wikidata, the use of SPARQL for implementing logical constraints to validate the knowledge graph has been proven successful in comparatively evaluating semantic relations and even in evaluating non-relational statements[11]. Preliminary tests done by Houcemeddine Turki and Daniel Mietchen on the Quest tool developed by Magnus Manske confirmed the possibility of defining logical constraints as SPARQL query for eliminating obsolete Wikidata statements or inferring new statements based on existing ones. Concerning the automation of the validation of Wikidata entities based on ShEx EntitySchema, the direction of Wikimedia Deutschland is to define a new property to link between Wikidata classes and corresponding EntitySchemas[25]. It is evident that this method will not be efficient in several situations where EntitySchemas have not been created to validate the elements of a given class[25]. Instead, it is proven that the set of conditions for the Wikidata entities to be validated by a given EntitySchema can be inferred from the EntitySchema itself[25]. This is enabled thanks to the extraction of closed constraints from EntitySchemas and their conversion to SPARQL queries for the identification of corresponding Wikidata entities[25]. Beyond this, we also verified whether all the medical classes in Wikidata are represented by EntitySchemas or not. We found a significant lack of EntitySchemas and that is why we generated a list of missing EntitySchemas and made it available for the Community with guidelines about how to solve this important matter at https://www.wikidata.org/wiki/Wikidata:WikiProject_Medicine/Schemas.

Defining the research and development ecosystem of Wikidata as an open knowledge graph (S3.A2)

[edit]

The Wikidata research and development community is mainly encompassed in the Global North[26][27]. This is mainly linked to the increasing interest in Europe and North America in open knowledge databases as free resources for computer science research, particularly in Germany[26][27]. The advantage of Germany over other developed countries conflicts with the domination of the United States of America and China over open knowledge graph research[28]. The domination of Germany in Wikidata research and development can be in part due to the long history of German scientists in developing open knowledge graphs such as DBpedia[26][28]. This can also be due to the implication of Wikimedia Deutschland in maintaining Wikidata[26][27]. Wikimedia Deutschland has an important function in making critical coding projects for the development of Wikidata and in encouraging the dissemination of the knowledge graph in different circles such as scientific societies and digital heritage institutions[26][27]. The lack of interest of the African community in using Wikidata in creating knowledge-based systems, particularly in the context of biomedical machine learning, is mainly related to the higher motivation of African scientists to work on few-shot learning from raw data rather than dealing with data pre-processing based on knowledge resources[29]. The few efforts of African institutions in open research and development come either from grassroots (self-organized and distributed research communities) such as Masakhane and SisonkeBiotik[30] or from research structures working in public universities like Data Engineering and Semantics Research Unit[31]. To solve this matter, we created a new user group in Wikimania 2023 Conference for the Wikidata Arabic Community located in the Middle East and North Africa Region on October 29, 2023.

When comparing Wikidata research and development to the one of Wikipedia, we find a lack of involvement of tech giants in research and development related to Wikidata[27]. This can be due to the interest of the tech giants in enhancing the user experience of the visitors of top websites like Wikipedia[27]. We also identify a higher involvement of active Wikimedians in research and development about Wikidata[26][27]. This is mainly due to the flexible and structured format of Wikidata that allows intermediate coders to easily process it[27]. Finally, we outline that the implementation of new features in open knowledge graphs such as Wikidata is ahead of the creation of research works about the new updates[28]. This proves the function of open knowledge graphs, particularly Wikidata, as incubators for developing semantic web research[28].

Defining the medical applications of Wikidata as an open knowledge graph (S3.A2)

[edit]

Thanks to its structured format, Wikidata can be leveraged to create knowledge-based systems for real-world applications[10]. This is mainly enabled thanks to the Wikidata Query Service, the MediaWiki API, the Wikibase Integrator Python Library, the Wikidata RDF dumps, and the Wikidata Hub API among other programming tools[2][10].

One of the methods to do this is to mirror Wikidata as an offline database and trim it to only keep the Wikidata items related to medical practice[32]. The output will be a reduced RDF dump that includes multilingual biomedical knowledge in the form of triples and that can be used for named entity recognition and entity linking from electronic health records and then for customized medical reasoning based on patient data[32]. Another approach that can be efficient for developing medical applications of Wikidata is the real-time biomedical information retrieval from Wikidata to construct knowledge-based systems driven by the open knowledge graph[10]. This approach will directly exploit the updated edition of the medical and bibliometric knowledge in Wikidata to generate its output and will return an updated output for the same request as Wikidata is updated with new knowledge[10].

Examples of such a system are SPARQL-based web tools for visualizing the status of the COVID-19 knowledge (i.e., COVID-19 dashboards)[10] or for getting clinical decision support outputs (i.e., medical web apps)[10]. The advantage of developing these Wikidata-based web tools is that they can be easily implemented using elementary web coding (e.g., HTML and JavaScript) and semantic web (e.g., SPARQL) skills. Another advantage is that the development of such web tools is free of charge, allowing developing countries to have cutting-edge web tools beyond paywalls[10]. However, this can be at the expense of privacy as the SPARQL queries generated by these tools are stored in the Wikidata SPARQL query logs[33]. Other limitations can be the lack of precision of several Wikidata relations altering the efficiency of using them for biomedical reasoning[10], the lack of full mass import of external resources including core medical knowledge into Wikidata[10], and the lack of coverage of certain aspects of medical practice[10].

Challenges

[edit]

Open Science

[edit]

Our research project in the realm of Open Science encountered several notable challenges. Firstly, the decision to publish our research papers as preprints had a significant impact on the range of target journals available to us, as many journals have stringent licensing guidelines that conflict with preprint publishing, while others require anonymity in the peer review process. Consequently, this limitation, along with the often lengthy and time-consuming peer review procedures, resulted in substantial delays for our project. Additionally, the substantial costs associated with open-access Article Processing Charges (APCs) proved to be a significant financial burden. Furthermore, the open nature of preprints made our research discoveries vulnerable to being reproduced and modified by unscrupulous scientists, who could then publish the altered findings, potentially undermining the credit due to our work. Moreover, while Open Access primarily focuses on research papers, it does not extend to short communications, leaving these valuable contributions less accessible. Publishing preprints for every iteration of our research posed challenges related to the granularity of citations received, potentially distorting citation counts for our research publications. Legal concerns also arose concerning the open-access availability of bibliometric study datasets, particularly when dealing with proprietary databases like Scopus. Lastly, there were reservations about the storage of voluminous source codes and data in repositories, highlighting the need for practical and scalable solutions to address these issues. These multifaceted challenges underscore the complexities and trade-offs associated with practicing Open Science in contemporary research endeavors.

Computational Complexity

[edit]

A significant challenge that our research project faces is the need to process large volumes of data, which necessitates the parallelization of algorithms. Processing large datasets efficiently can be a formidable task, as it demands substantial computational resources and intricate algorithmic design. Parallelization involves breaking down complex computations into smaller, manageable tasks that can be executed simultaneously on multiple processors or cores. While parallelization can significantly expedite data processing, it brings its own set of challenges. Firstly, designing and implementing parallel algorithms requires a deep understanding of both the problem domain and the underlying hardware architecture. This demands expertise in distributed computing, multi-threading, and synchronization techniques, which can be a resource-intensive process. Additionally, identifying the optimal level of parallelism and load balancing is crucial to ensure that all computational resources are fully utilized, without creating bottlenecks. Furthermore, debugging and testing parallelized algorithms can be more intricate than their sequential counterparts. Issues such as race conditions, deadlocks, and data consistency must be carefully managed to avoid unexpected errors and ensure the reliability of the results. As the scale of data processing increases, these challenges become more pronounced, demanding robust testing and debugging strategies. The choice of a parallel computing framework or platform can also be a challenge. Selecting the most suitable technology for the specific research project, whether it's distributed computing clusters, GPU acceleration, or cloud-based solutions, requires careful consideration and may involve a learning curve. Lastly, scalability is an ongoing concern. As data continues to grow, ensuring that the parallelized algorithms can adapt and scale efficiently without compromising performance becomes a pressing issue. In sum, while parallelization of algorithms is essential for processing large datasets, it introduces complexities that demand careful planning, expertise, and ongoing management to meet the data processing needs of our research project.

Outputs

[edit]

Wikimedia Events

[edit]
Date Title Venue Details
30 June 2022 Wikidata as a resource for enriching Medical Wikipedia (in Arabic, 18 attendees) Arabic Wikipedia Day 2022

Noun_Project_note_taking_icon_2892274 Abstract
Icon_Notes Slides

12 July 2022 Bibliometric-Enhanced Information Retrieval as a tool for enriching and validating Wikidata (in English, 65 attendees) 2022 LD4 Conference on Linked Data Noun_Project_note_taking_icon_2892274 Slides

Video_Camera_Icon Video
Icon_Notes Notes

13 August 2022 Let us play with PubMed to enrich Wikidata with medical information (in English, thirty online attendees and eight in-person attendees) 2022 Wikimania Hackathon
Wikimania 2022 in Tunisia
Noun_Project_note_taking_icon_2892274 Slides

Video_Camera_Icon Video
Icon_Notes Notes

11 May 2023 Developing a Wikimedia-related research structure in a developing country (in English, twenty attendees) Wiki Workshop Noun_Project_note_taking_icon_2892274 Paper

Video_Camera_Icon Video
Icon_Notes Notes

17 August 2023 Let us upgrade medical practice with Wikidata (in English, 29 attendees) Wikimania 2023 Noun_Project_note_taking_icon_2892274 Slides

Video_Camera_Icon Video
Icon_Notes Notes

17 August 2023 Mastering the Secrets of the Wikimedia Research and Development Ecosystem in Africa (in English, 34 attendees) Wikimania 2023 Noun_Project_note_taking_icon_2892274 Slides

Video_Camera_Icon Video
Icon_Notes Notes

18 August 2023 Empowering Wikidata and Wikipedia with Generative AI: Unlocking the Potential of Scholarly Publications (in English, 44 attendees) Wikimania 2023 Noun_Project_note_taking_icon_2892274 Slides

Video_Camera_Icon Video
Icon_Notes Notes

29 October 2023 Wikidata Birthday Presents Lightning Talks Session (in English, 29 attendees) WikidataCon 2023 Video_Camera_Icon Video I

Video_Camera_Icon Video II
Icon_Notes Notes

Office Hours

[edit]
Date Title Details
25-26 August 2022 Growing AI for Healthcare in Africa: Telling our story (in English, 118 attendees)
SisonkeBiotik: Africa Machine Learning and Health Workshop
Noun_Project_note_taking_icon_2892274 Slides
23 October 2022 Let us solve the mysteries behind Wikidata (in French, 16 participants including 8 attendees and 10 contributors to the tutorial)
Wikidata Tenth Birthday

Video_Camera_Icon Video
Icon_Notes Notes

09 March 2023 Empowering Biomedical Informatics with Open Resources: Unleashing the Potential of Data and Tools (in English, 8 attendees)
The Stanford MedAI Group Exchange Sessions
Noun_Project_note_taking_icon_2892274 Slides
Video_Camera_Icon Video

Icon_Notes Notes

20 May 2023 Schemas for Medical Entities (in English, 10 participants on-site, 6 remotely)
Wikimedia Hackathon 2023
Noun_Project_note_taking_icon_2892274 Slides
Icon_Notes Abstract
16 August 2023 How to use Wikidata to build web tools for the social good (in English, 19 attendees)
Wikimania Hackathon 2023
Noun_Project_note_taking_icon_2892274 Slides
Video_Camera_Icon Video
Icon_Notes Abstract and Notes
30 September 2023 Wikidata for Better Health (in French, 16 attendees)
Project Showcase
Noun_Project_note_taking_icon_2892274 Slides
Video_Camera_Icon Video
Icon_Notes Abstract

Tutorials

[edit]
Date Title
16 and 23 July 2022 Introduction to Wikidata, User Scripts, Wikidata Query Service, and OpenRefine (in French)
Wiki Wake Up Afrique

Reports

[edit]
Date Title Details
1 April 2022 WikiProject Clinical Trials for Wikidata Icon_Notes Full Text
28 April 2023 Machine Learning for Healthcare: A Bibliometric Study of Contributions from Africa Icon_Notes Full Text
23 October 2023 From MeSH Keywords to Biomedical Knowledge in Wikidata: The giant move Icon_Notes Full Text
29 October 2023 Wikidata Arabic Community Icon_Notes Full Text

Research Venues

[edit]
Date Title Venue Details
16 November 2022 Letter to the Editor: FHIR RDF - Why the world needs structured electronic health records Journal of Biomedical Informatics Noun_Project_note_taking_icon_2892274 Abstract

Icon_Notes Full Text

5 May 2023 A Framework for Grassroots Research Collaboration in Machine Learning and Global Health Machine Learning and Global Health Workshop (MLGH@ICLR 2023)

Icon_Notes Full Text

7 November 2023 Ten years of Wikidata: A bibliometric study Wikidata Workshop (Wikidata@ISWC 2023) Noun_Project_note_taking_icon_2892274 Slides
Icon_Notes Full Text
7 November 2023 Automating the use of Shape Expressions for the validation of semantic knowledge in Wikidata Wikidata Workshop (Wikidata@ISWC 2023) Noun_Project_note_taking_icon_2892274 Slides
Icon_Notes Full Text
7 November 2023 Preregistration: Comparing the use of Wikidata and Wikipedia by open-source software programmers on GitHub repositories Wikidata Workshop (Wikidata@ISWC 2023) Noun_Project_note_taking_icon_2892274 Slides
Icon_Notes Full Text
20-25 May 2024 A Decade of Scholarly Research on Open Knowledge Graphs The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Icon_Notes Full Text
4 September 2024 MeSH2Matrix: Combining MeSH keywords and machine learning for biomedical relation classification based on PubMed Journal of Biomedical Semantics Icon_Notes Full Text
17 September 2024 A framework for integrating biomedical knowledge in Wikidata with open biomedical ontologies and MeSH keywords Heliyon Icon_Notes Full Text

Conference

[edit]
Date Title Details
11 October - 31 December 2024 Arabic Wikidata Days 2024 (in Arabic) Icon_Notes Report

References

[edit]
  1. Callahan, T. J., Tripodi, I. J., Pielke-Lombardo, H., & Hunter, L. E. (2020). Knowledge-based biomedical data science. Annual review of biomedical data science, 3, 23-41. PMC:8095730.
  2. a b c d e f g h i j k Turki, H., Shafee, T., Hadj Taieb, M. A., Ben Aouicha, M., Vrandečić, D., Das, D., & Hamdi, H. (2019). Wikidata: A large-scale collaborative ontological medical database. Journal of biomedical informatics, 99, 103292. doi:10.1016/j.jbi.2019.103292.
  3. Sun, H., Depraetere, K., De Roo, J., Mels, G., De Vloed, B., Twagirumukiza, M., & Colaert, D. (2015). Semantic processing of EHR data for clinical research. Journal of biomedical informatics, 58, 247-259. doi:10.1016/j.jbi.2015.10.009.
  4. a b Kang, N., Singh, B., Bui, C., Afzal, Z., van Mulligen, E. M., & Kors, J. A. (2014). Knowledge-based extraction of adverse drug events from biomedical text. BMC Bioinformatics, 15(1), 64:1-64:8. doi:10.1186/1471-2105-15-64.
  5. Hong, G., Kim, Y., Choi, Y., & Song, M. (2021). BioPREP: Deep learning-based predicate classification with SemMedDB. Journal of Biomedical Informatics, 122, 103888. doi:10.1016/j.jbi.2021.103888.
  6. Nicholson, N. C., Giusti, F., Bettio, M., Negrao Carvalho, R., Dimitrova, N., Dyba, T., et al. (2021). An ontology-based approach for developing a harmonised data-validation tool for European cancer registration. Journal of Biomedical Semantics, 12(1), 1:1-1:15. doi:10.1186/s13326-020-00233-x.
  7. Slater, L. T., Bradlow, W., Ball, S., Hoehndorf, R., & Gkoutos, G. V. (2021). Improved characterisation of clinical text through ontology-based vocabulary expansion. Journal of Biomedical Semantics, 12(1), 7:1-7:9. doi:10.1186/s13326-021-00241-5.
  8. Tehrani, F. T., & Roum, J. H. (2008). Intelligent decision support systems for mechanical ventilation. Artificial Intelligence in Medicine, 44(3), 171-182. doi:10.1016/j.artmed.2008.07.006.
  9. Odekunle, F. F., Odekunle, R. O., & Shankar, S. (2017). Why sub-Saharan Africa lags in electronic health record adoption and possible strategies to increase its adoption in this region. International Journal of Health Sciences, 11(4), 59. PMC:5654179.
  10. a b c d e f g h i j k l m n o Turki, H., Hadj Taieb, M. A., Shafee, T., Lubiana, T., Jemielniak, D., Ben Aouicha, M., ... & Mietchen, D. (2021). Representing COVID-19 information in collaborative knowledge graphs: the case of Wikidata. Semantic Web, 13(2), 233-264. doi:10.3233/SW-210444.
  11. a b c d e f g h Turki, H., Jemielniak, D., Hadj Taieb, M. A., Labra Gayo, J. E., Ben Aouicha, M., Banat, M., ... & Mietchen, D. (2022). Using logical constraints to validate statistical information about COVID-19 in collaborative knowledge graphs: the case of Wikidata. PeerJ Computer Science, 8, e1085. doi:10.7717/peerj-cs.1085.
  12. a b Labra Gayo, J. E. (2022). WShEx: A language to describe and validate Wikibase entities. In Proceedings of the 3rd Wikidata Workshop 2022 (Wikidata 2022) (pp. 4:1-4:12). Hangzhou, China: CEUR-WS.org. doi:10.48550/arXiv.2208.02697
  13. a b Tran, B. X., Vu, G. T., Ha, G. H., Vuong, Q. H., Ho, M. T., Vuong, T. T., et al. (2019). Global evolution of research in artificial intelligence in health and medicine: a bibliometric study. Journal of clinical medicine, 8(3), 360. doi:10.3390/jcm8030360
  14. Mora-Cantallops, M., Sánchez-Alonso, S., & García-Barriocanal, E. (2019). A systematic literature review on Wikidata. Data Technologies and Applications, 53(3), 250-268. doi:10.1108/DTA-12-2018-0110
  15. Prana, G. A. A., Treude, C., Thung, F., Atapattu, T., & Lo, D. (2019). Categorizing the content of github readme files. Empirical Software Engineering, 24(3), 1296-1327. doi:10.1007/s10664-018-9660-3
  16. Van Eck, N., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523-538. doi:10.1007/s11192-009-0146-3
  17. a b c d e f g h i j Turki, H., Dossou, B. F. P., Emezue, C. C., Hadj Taieb, M. A., Ben Aouicha, M., Ben Hassen, H., & Masmoudi, A. (2022). MeSH2Matrix: Machine learning-driven biomedical relation classification based on the MeSH keywords of PubMed scholarly publications. In BIR 2022: 12th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2022 (pp. 45-60). Stavanger, Norway: CEUR-WS.org. https://ceur-ws.org/Vol-3230/paper-07.pdf
  18. Vasiliev, Y. (2020). Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press.
  19. Delpeuch, A. (2020). A Survey of OpenRefine Reconciliation Services. In Proceedings of the 15th International Workshop on Ontology Matching co-located with the 19th International Semantic Web Conference (ISWC 2020) (pp. 82-86). Athens, Greece: CEUR-WS.org. https://ceur-ws.org/Vol-2788/om2020_STpaper3.pdf
  20. Turki, H., Hadj Taieb, M. A., & Ben Aouicha, M. (2022). How knowledge-driven class generalization affects classical machine learning algorithms for mono-label supervised classification. In International Conference on Intelligent Systems Design and Applications (pp. 637-646). Springer, Cham. doi:10.1007/978-3-030-96308-8_59
  21. Sundararajan, M., Taly, A., & Yan, Q. (2017, July). Axiomatic attribution for deep networks. In International conference on machine learning (pp. 3319-3328). PMLR. http://proceedings.mlr.press/v70/sundararajan17a.html
  22. a b c d e f Rasberry, L., Tibbs, S., Hoos, W., Westermann, A., Keefer, J., Baskauf, S. J., ... & Mietchen, D. (2022). WikiProject Clinical Trials for Wikidata. medRxiv. doi:10.1101/2022.04.01.22273328v1.
  23. a b c d e f g h i j k l m n o p q r s t u v w Turki, H., Chebil, K., Dossou, B. F. P., Emezue, C. C., Owodunni, A. T., Hadj Taieb, M. A., & Ben Aouicha, M. (2024). A framework for integrating biomedical knowledge in Wikidata with open biomedical ontologies and MeSH keywords. Heliyon.
  24. a b Turki, H., Dossou, B. F. P., Emezue, C. C., Owodunni, A. T., Hadj Taieb, M. A., Ben Aouicha, M., Ben Hassen, H., & Masmoudi, A. (2024). MeSH2Matrix: Combining MeSH keywords and machine learning for biomedical relation classification based on PubMed. Journal of Biomedical Semantics. doi:10.1186/s13326-024-00319-w.
  25. a b c d e Turki, H., Hadj Taieb, M. A., Ben Aouicha, M., Rasberry, L., & Mietchen, D. (2023). Automating the use of Shape Expressions for the validation of semantic knowledge in Wikidata. In Wikidata Workshop 2023 (Wikidata@ISWC 2023).
  26. a b c d e f Turki, H., Hadj Taieb, M. A., Ben Aouicha, M., Rasberry, L., & Mietchen, D. (2023). Ten years of Wikidata: A bibliometric study. In Wikidata Workshop 2023 (Wikidata@ISWC 2023).
  27. a b c d e f g h Turki, H., Hadj Taieb, M. A., Ben Aouicha, M., Rasberry, L., & Mietchen, D. (2023). Preregistration: Comparing the use of Wikidata and Wikipedia by open-source software programmers on GitHub repositories. In Wikidata Workshop 2023 (Wikidata@ISWC 2023).
  28. a b c d Turki, H., Owodunni, A. T., Hadj Taieb, M. A., Bile, R. F., Ben Aouicha, M., & Zouhar, V. (2023). A Decade of Scholarly Research on Open Knowledge Graphs. In LREC-COLING 2024. doi:10.48550/arXiv.2306.13186.
  29. Turki, H., et al. (2023). Machine Learning for Healthcare: A Bibliometric Study of Contributions from Africa. SisonkeBiotik. doi:10.20944/preprints202302.0010.v2..
  30. Currin, C., Asiedu, M. N., Fourie, C., Rosman, B., Turki, H., et al. (2023). A Framework for Grassroots Research Collaboration in Machine Learning and Global Health. In MLGH@ICLR 2023. https://zenodo.org/record/7859696.
  31. Turki, H., Akermi, M., Amara, A., Hadj Taieb, M. A., Chebil, K., Mietchen, D., & Ben Aouicha, M. (2023). Developing a Wikimedia-related research structure in a developing country. In Wiki Workshop 2023. https://wikiworkshop.org/2023/papers/WikiWorkshop2023_paper_51.pdf.
  32. a b Turki, H., Rasberry, L., Hadj Taieb, M. A., Mietchen, D., Ben Aouicha, M., Pouris, A., & Bousrih, Y. (2022). Letter to the Editor: FHIR RDF - Why the world needs structured electronic health records. Journal of Biomedical Informatics, 136, 104253. doi:10.1016.j.jbi.2022.104253.
  33. Bonifati, A., Martens, W., & Timm, T. (2019, May). Navigating the maze of Wikidata query logs. In The World Wide Web Conference (pp. 127-138). doi:10.1145/3308558.3313472.