This project aims to experiment what a software solution that allow to explicitly formalize relations, such as Wikibase, could bring to fill needs of wiktionarian projects in term of cohesively structured data.
The various linguistic version of Wiktionary collect a lot of redundant information, that is they don't share much common information. Even in a given instance, information like a quotation or a definition on a page will be often manually be fully duplicated on an other page, and nothing will prevent divergent evolution of these duplicated data. Furthermore they are not structured in a fashion that ease querying information at a fine level granularity nor to simplify cross-referring data.
On the other hand projects like Wikidata, and Lexicographical data follow a path toward a more cohesively structured data that ease these points. But currently they don't provide much things to leverage on for tackle Wiktionary specific needs, as they are oriented to very different goals and priorities. Furthermore Wiktionary being licensed under CC-BY-SA and Wikidata using CC-0 make any significant transfer of information legally impossible from Wiktionary to Wikidata and its Lexeme extension.
Of course, Wikitionary as it is do have many conveniences, like the flexibility of structuring data through simple wikicode, templates, modules and so on. It has several solid linguistic communities with over a decade of common work and an international user group, the Tremendous Wiktionary User Group (TWUG).
No obvious simple quick path is known to get the best out of these two approaches. So this project doesn't come with any grand scheme to aim at this. Instead this project will go through little steps of experiment, gather feedback, improve, repeat.
This project is specifically willing to help wiktionnarian communities, so having contributors from its different linguistic versions would be warmly welcome. A simple hello on the talk page would already be greatly appreciated, and more thorough comments are encouraged.
We also specifically need people with:
- skills to spread the word both within Wikimedia circles and beyond (communication facilitators)
- will to formalize lexicological/lexicographic data models (ontologist)
- interest in developing Mediawiki/Wikibase extensions (developers)
- experience with Wikibase deployment and maintenance, especially of tools in Wikimedia Cloud Services (sysops)
The project currently focus at setting up a Wikibase instance on Wikimedia Cloud Services (WCS) and fill it with some quotes imported from wiktionarian projects. Quantitatively, it's not expected to go further than import a few thousand items as a high limit, if bot are to be used.
Please note that this first experiment will especially not include material such as definitions, grammatical classes, and so on. Indeed, this choice of focusing on quotations is done to make something already going on, build a team with experience in deploying and maintaining a Wikibase instance, and transfer some wiktionarian data into it. That way, the whole project won't be completely stuck with the data modeling part before anything browsable can be shown. This approach nonetheless already requires a proper model for quotations. Luckily the Structured Wikiquote project already paved the way on this regard.
This section gather some data on what was already done and what is expected along this project
- Done Data gathering about possibility to host a Wikibase in WCS during Wikimedia Hackathon 2021
- State of the art
- Done find if other initiative already made something around Wikibase and quotation
- Done fill the below See also section with related links
- Structuring the project
- Team building and community involvement
- making wikimedians aware of the project
- on wiki calls to join the project
- spread the word on instant messaging platform and social media
- determine and announce needed skills and resources
- making wikimedians aware of the project
- Wikibase instance
- deployment with required ontology to test import of quotes extracted from wiktionarian projects
- Lexical data model
- animate conversations around what is needed and idea to match these requirements
- work out at least one specific proposal, build a consensual proposal, refine into a data model
- implement the data model, deploy on the Wikibase instance
- test the model, specify what should import data,
- call for more tests from community
- Assessment of obtained results, determination of next steps
- https://www.wbstack.com/ allows to launch a Mediwiki/Wikibase instance very quickly, which should be great for our first drafts
- WBStack Telegram group has be joined and some discussion started there around in situ.
- GreenReapder indicated that to set a specific licence and allow to import CC-BY-SA-3.0 information, it will be necessary to use the front page, sidebar and/or
MediaWiki:Editnotice-[namespaceID]to announce it, rather than
MediaWiki:Copyright, due to current permissions restrictions on the platform. More information is given in Allow users to alter the sidebar · Issue #52 · wbstack/mediawiki
- https://www.openresearch.org/mediawiki/index.php?title=ISWC&action=formedit showcase a very different approach, not relying on Wikibase, but which do bring more cohesive data structures with ability to edit fields in forms
- https://www.mediawiki.org/wiki/Help:Tabular_Data ease the use of cohesive tabual data within Mediawiki, without relying on the backend: both data and code are stored within Scribunto modules. That might be an intermediate way to propose data relations which are more explicitely encoded and digitally tracktable, without requiring any special access to a Wikibase instance.
Csisc and psychoslave exchanged 40 minutes, as an instance was finally set up for prototyping ideas around Trans Situ
- https://trans-situ.wiki.opencura.com is the dedicated instance
- psychoslave presented a UML user case that underlie the linguistic perspective of this project and an admittedly rather cluttered class diagram produced to support such a view, while expressing that something far more simpler should be targeted as a first prototype
- Csisc proposed to reduce the model to two classes and will draft something in the next few days based on that idea. The two classes discussed where
- 1. Relation
- 2. Utterance
- matter: text, for example "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
- 1. Relation
- psychoslave will ask what can be done to deal properly with license and switch the wiki statements to CC-by-sa instead of the current CC-0 labels
Data model proposals
Several data model (ontology) and approach might be envisioned to meet requirements of in situ. Two main roads are already identified :
- a single Wikibase instance to be used by all Wiktionary versions and other projects, hence called trans situ;
- one Wikibase instance for each linguistic version of Wiktionary, hence called per situ.
Other approaches are still warmly welcome for now.
- Cognitive NLP The wondrous challenge of human language Dave Raggett, ERCIM 11 January 2021
- Computers with Common Sense, Dave Raggett, 2010/03/05
- Web of Thought The logical next step after the Web of Things, Dave Raggett, 18 October 2014
- Vers une Wikibase dédiée aux Wiktionnaires 31/05/2021
- Requests for comment/Cross-wiki management of Wiktionary headwords using a Wikidata-like approach
- Structured Wikiquote
- Wiktionary/Tremendous Wiktionary User Group
- Web Ontology Language
- Wikimedia Cloud Services
- Csisc (talk) : volunteer. Interested to involve in Wikibase Installation and Lexical Data Modelling. 17:52, 31 May 2021 (UTC)
- Psychoslave (talk) : coordinator, creator and volunteer. Interested in Wikibase Installation, Lexical Data Modelling, roadmap and team building. 10:14, 1 June 2021 (UTC)
- Vis M (talk) Volunteer. Interested in Lexemes project and hoping for a structured Wiktionary in future. 07:58, 1 September 2021 (UTC)