WikiFAIR

From Meta, a Wikimedia project coordination wiki
(Redirected from User:Kristbaum/easyFAIR)

WikiFAIR is a set of ideas, instructions and helpful examples to archive the FAIR-Prinicples in research projects using Wikimedia systems and technologies. More specifically it's about integrating Wikimedia platforms in scientific workflows to promote free knowledge, while reducing hurdles in building research infrastructure.

FAIR principles[edit]

Graphic showing the four FAIR principles
FAIR principles

The FAIR data principles are guidelines designed to improve the Findability, Accessibility, Interoperability, and Reusability of digital assets. These principles emphasize the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.

Full implementation of every aspect of the FAIR principles is difficult for most research projects, especially smaller ones. Depending on whether a new digital presentations platform needs to be build, or research data needs to be published the scale of possible hurdles and potential costs is significant. But most researchers want to, and often need to follow these guidelines to archive the best scientific outcomes.

Integrating research projects with Wikimedia's ecosystem allows research teams to benefit from adhering to FAIR data standards and making the research data consumable in many different formats. The specific results depend on the chosen integration model, which can range from simply linking the data sets, to fully using multiple Wikimedia Projects to host structured data, media files and other formats about the research.

Wikimedia and FAIR[edit]

Logos of the various Wikimedia projects
Wikimedia's projects

The Wikimedia Movements goal is to become the essential infrastructure of the ecosystem of free knowledge, and accordingly implementing FAIR principles across all projects is a core goal. As WikiFAIR mostly focuses on the semantic knowledge base Wikidata and the free media repository (Wikimedia) Commons, we will explore them in particular. Most of the details about Wikidata also apply to Wikibase, the open source software powering it, which can also be run independently in a federated system.

Putting these projects in the FAIR definitions of GO FAIR:

Findable[edit]

All Wikimedia projects use unique and persistent identifiers (QID for Wikidata, Filenames for Commons) as required in F1, with descriptive metadata for F2. Every entry includes the identifier (F3) and is index by multiple search engines (F4).

Accessible[edit]

Wikimedia sites are offering free content ranging from CC BY-SA (often found on Commons) to Public Domain (all of Wikidata) and are retrievable by their identifier with standardized, open protocols (A1, A1.1). There is also multi-project spanning Single User Login system (A1.2).

Metadata about deleted/removed objects can be retrieved via deletion logs as demanded by A2.

Interoperable[edit]

Structured Data on Wikidata can be represented a completely multilingual user interface and is available in the most common and FAIR (I2) export formats like RDF, JSON, TTL and more (I1). The dataset is interlinked with many other authority databases (I3).

Map of the world showing the distribution of Wikidata items
Map of places in Wikidata (2023)

Reusable[edit]

The complete Wikidata space features well over 100 Million items, while containing 1.5 Billion Triples, created from a pool of over 10,000 properties (R1). All data is released under clear and open licensing (see #Accessible R1.1) and a referencing system in applied across all Wikimedia projects, requiring citations and provenance information in most circumstances (R1.2).

The data model implements or approximates different community standards, by linking properties to their relevant equivalent in other standards. (R1.3)

Additional Features of Integration[edit]

Aside of helping with FAIR-requirements, integrating data in with the broader Wikimedia community can help with different aspects of maintaining, visualizing or just hosting your data set.

Community[edit]

The Community can help with long term quality management, through crowdsourcing improvements to the data. There are thousands of volunteers and bots scraping through the data each day, patrolling changes, adding or removing statements or other helpful tasks. This can mean that a newly created Wikidata item from your project receives a lifetime of maintenance as part of the wider dataset.

The community can also help with questions of data modeling, writing SPARQL Queries or improving related properties. In Commons they can categorize uploaded media and add new depicts statements, also enriching the data.

Allowing the community access can also help with fulfilling the expanded set of criteria CARE Principles for Indigenous Data Governance, in which it is required to allow access to a divers set of related groups, partly already represented in the Wikimedia Movement.

5-Star Open Data[edit]

picture showing the 5-star ratings of the scheme
5-Star Open Data deployment scheme

A different concept to FAIR, with similar intentions, is the 5-Star Open Data system. It categorizes data by qualities of accessibility and machine readability and was proposed by Tim-Berners Lee. With the addition of Structured data on Commons both it and Wikidata offer the highest level in this scheme, and by linking/integrating data with them research projects can do the same with little effort.

Software[edit]

Logo of Mediawiki
Logo of Mediawiki

From a software point of view, both MediaWiki and Wikibase are well tested, open source tools that offer full versioning, multi-user editing and caching for scalability. They are capable of accommodating teams of editors in most sizes, and bring a well developed extension system to customize them even further.

External Tools[edit]

Data on Wikimedia projects can be processed with a plethora of diverse external tools, allowing for more possibilities than many other platforms. For an overview see Wikidata:Tools and Commons:Tools or visit Toolhub.

Possible Caveats[edit]

Before continuing further, please be advised that there could be downsides to some parts of this process.

(Partial) loss of control[edit]

Wikimedia projects are generally open to everyone to edit, which can be very appealing. But it's also possible to get your edits reverted, or to have your data vandalized by malicious editors.

If something like this happens, adhere to community guidelines or ask for help if anything remains unclear. Depending on your research goals and methods, you may want to monitor your data after upload. See WikiFAIR#Monitoring Uploaded Data for more information.

Alternatively it could be helpful to run your own Wikibase instance to better control editing access to the data set, see #Running your own Knowledge Graph with Wikibase.

Sourced statements requirement[edit]

It is mostly required to cite sources to allow your data into any Wikimedia project, more information on that here. This can be a problem for original research, or hard to source facts. If the source is a paper your team is going to write, it can also be cited, but the problem still remains for some special cases. Asking in the Project Chat, to help with questions regarding sources can be helpful.

Integrating Structured Data in Wikidata[edit]

Wikidata's logo
Logo of Wikidata

As stated on the Main Page of Wikidata:

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

This already suggests some ways to proceed, either by adding the data by hand or by using tools to modify Wikidata on a bigger scale.

Inclusion Criteria[edit]

Most kinds of structured data fit in Wikidata, which allows for great freedom in how you want to model your contribution. There are also (deliberately broad) rules about inclusion in the scope of the project, but for researchers the second is the most important:

[The item] refers to an instance of a clearly identifiable conceptual or material entity that can be described using serious and publicly available references.

Having publicly available sources for the added data is key to the future of Wikidata as a quality source, and quite easy to implement, see #Technical Process.

Also please apply a measure of diligence to how you contribute to the dataset. If your data only adds some statements to various items, there shouldn't be any problem, but be careful when creating new items if you are not 100% sure if they actually contribute relevant knowledge.

Technical Process[edit]

To learn about the process of adding and editing existing items follow a tour on Wikidata:Tours. There are many different routes to upload data with tools outside, some popular options are Openrefine, Bots or Quickstatments.

Attribution[edit]

To collect and show the contributions made by a project, it can be useful to create a project user account. Editing with the project account lets you identify contributions as belonging to the project. Linking and referencing Wikidata statements back to the project database or publications can also be helpful, especially with the properties

Monitoring Uploaded Data[edit]

You can monitor uploaded data via Watchlists or Maintenance Queries. See for example the Wikiproject Chess Maintenace Queries.

Creating a Knowledge Graph with Wikibase[edit]

a logo
Logo of Wikibase

If you prefer to host the data partially or completely outside of Wikidata, but still want a similar feature set, you might want to use Wikibase. It consists of the same, open source, software Wikidata uses, but completely under your control. With it it is possible to model the data how you see fit while still using some of the ecosystem and tools described on this page.

Technical Process[edit]

Wikibase can be set up in two major ways: As a hosted service on Wikibase.cloud or as a on-premises install using Docker with Wikibase suite. Here is a guide to help you choose.

Free Media on Wikimedia Commons[edit]

Commons logo
Logo of Wikimedia Commons

Wikimedia Commons, or sometimes referred to as "Commons" is a media repository of free-to-use images, sounds, videos and other media.

Inclusion Criteria[edit]

Many types of files can be hosted on Wikimedia Commons if they have the applicable licensing, CC-BY-SA or better, see Commons:Licensing.

Not every kind of data is allowed in Commons, even if the licensing is correct. The uploaded data should be "educational", normally not a problem for research data and fit into the project scope of Wikimedia Commons.

If you are unsure if your media is compatible for upload in Commons, you can ask here. There are also other Projects that can host freely licensed media, e.g Archive.org.

Technical Process[edit]

For uploading images to Wikimedia Commons follow this tutorial. There is also this list, that shows the supported file types, nearly any relevant open format is supported

Attribution[edit]

There are multiple ways to attribute the research project or organization, depending on the platform. Creating a User Account can help with attribution too.

Media files on Commons can be labeled (via Structured Commons and traditionally) with a creator property. Alternatively, if the files are not a creation of the researchers, it would be possible to add templates that indicate to origin and context of the upload to the file page, and of course the User property in the file history or the source statement is always a way to reference to the project account.

Monitoring Uploaded Data[edit]

You can monitor uploaded data via Watchlists or Maintenance Queries. See for example the Wikiproject Chess Maintenace Queries.

Other Wikimedia Projects[edit]

In some cases even an integration with other Wikimedia Projects could be beneficial. Please note that all of those projects are generally not multilingual like Commons or Wikidata, so using the appropriate language version would be the correct procedure.

Wikipedia[edit]

Citing your own research in Wikipedia can be allowed in some cases, but overdoing it should be avoided. Following this guide for academics and researchers is generally a good start.

Another logo
Logo of Wikisource

Wikisource[edit]

Adding new source texts or improving translation of an existing one would be possible ways to use this Wikimedia project. Using Wikisource as a platform to publish historical or non-copyrighted texts allows access to features like:

  • Different export formats (EPUB and MOBI for eReaders, PDF, RTF and more)
  • Configurable reader
  • The Wikisource Community for proofreading of translations and transliterations
  • A correlated Wikidata item to the text
Another logo
Logo of Wikispecies

Wikispecies[edit]

Which could be interesting for adding, or linking to specific species.

A logo
Logo of Wiktionary

Wiktionary[edit]

For lexicographers, adding new dictionaries as sources, or new senses could be a worthwhile endeavor. Additionally Wikidata supports Lexemes since 2018 and will be an easy to (re)use platform, since it's licensing is public domain.

Integration Models[edit]

Depending on your requirements the integration in the WM ecosystem can vary drastically from simple links or statements, to most of the research data being hosted there. Also more effort is required to archive deeper levels of integration.

1: Project documentation[edit]

Logo of a project, uploaded to Commons

Every research project is relevant for inclusion in Wikidata, so an easy first step is to create/improve a Wikidata Item for your research project. When you don't find your Project listed in Wikidata, you can create a new Item yourself and follow the model from this Wiki Project. It is also a place to ask for help with the process. This item could include:

  • Involved research institutions and researchers
  • The project logo on Commons
  • Addtional items for your publications

Integration results[edit]

The different elements of your research are now easier to find and link in Wikimedia projects, but also the Internet in general. Allows to use services like the Wikidata query service to show information about the project, e.g like this query which displays a graph of connected information about a project.

Furthermore there is a specialized service to display information about different research topics called Scholia, which display a number of relevant information about researchers or projects from Wikidata. For example see this page on the Technical University of Denmark. There are also some tools that can highlight different aspects of a single Wikidata item.

Example[edit]

VerbaAlpina is a research project for languages in the alpine region is described as Wikidata item Q66817486. It contains information about:

2: Linking to a separate research database[edit]

Connect your research database to Wikidata, through matching Wikidata items to entries in your database and and adding links to one or both. This link could be modeled through specific Wikidata statements, e.g. by using P973: described by url or proposing a project related property.

Adding a property allows for more long term quality control through tools like the constraint report system, or pregenerated queries for quality violations.

Integration Results:[edit]

  • Improved visibility of the original database
  • Ability to query your own data with the additional data from Wikidata

Example[edit]

Deckenmalerei.eu links most entries to Wikidata

3: Linking to your own MediaWiki[edit]

Especially interesting are projects that use MediaWiki as their base software, since it allows for interesting crossovers. MediaWiki can be used as a simple text and media base for people to edit, while more complex templates pull data out of Wikidata to display in local infoboxes.

Integration Results:[edit]

  • Easy to use interface for editing multimedia content with text and images
  • Possibility to use the extensive Wikipedia template ecosystem
  • Showing live metadata from Wikidata in infoboxes using Wikibase Client

Example[edit]

WIP

Connected nodes showing different Wikimedia projects and some independent Wikibases
Vision of the Linked Open Data Web

4: Linking to your own Wikibase[edit]

Alternatively, following the Vision of a Linked Open Data future, Wikibase could be run as a completely independent instance, or hosted on Wikibase Cloud, while federating with Wikidata via the related property.

Integration Results:[edit]

  • Use a semantic database, but with complete control over every aspect
  • Model data in different ways than Wikidata

Example[edit]

Factgrid has a property in Wikidata while being completely independent of Wikimedia as a whole. Remove NA could serve as a great example. They are also linked via a property.

5: Enriching data with your research[edit]

After matching with your data, the next step could be to enhance Wikimedia projects by adding data from your project, while citing your research as a source.

When your project doesn't provide an online database, your project could just add data and use your publication as a reference. So instead of hosting your own database, publishing your research table as a file, and referencing it.

To mass upload metadata or Media files there are multiple routes (Openrefine).

6: Integrating Media on Commons[edit]

Add images, videos, sounds, PDFs to Wikimedia Commons, including structured data for those media files. This can be more troublesome, as copyright-compatibility needs to be checked, but negates the complexity of hosting those files yourself.

Because the projects is integrated with Wikipedia, its capabilities and uptime are production ready and very well integrated in other large hubs for media data. E.g it is index completely by popular search engines.

Additionally allows interesting uses for metadata tagging with structured data on Commons.

7: Even more?[edit]

Let's discuss!