Grants:PEG/Interglider.ORG/Wiktionary Meets Matica Srpska
At the moment, the Wikimedia Community in Serbia has a unique opportunity to seize momentum by establishing a strategic partnership with Matica Srpska, the oldest cultural institution in the region, thus paving the path to opening and digitizing copious material of the highest quality and accuracy. Matica Srpska is a highly valued cultural and scientific institution with globally recognized expertise in lexicography. Such a partnership would definitely contribute to an increase in the quality of Wikimedia content, but also to the credibility of Wikimedia projects and community, and it would advance the chances of cooperation with similar institutions that are still reserved towards the idea of free licensing and opening up their content.
In this pilot phase of cooperation, Matica Srpska is offering digitization and free licensing of two dictionaries - a Serbian ornithological dictionary and the Dictionary of Serbian speeches of Vojvodina. The ornithological dictionary is the first of its kind, and due to its structure, it provides us with an opportunity to develop and test a model of cooperation across Wiktionaries, including development of software that should facilitate parsing dictionaries and a more formalized structure of Wiktionary entries, if agreed upon within particular Wiktionary communities.
The overall strategic goal of the project is to improve the quality of Wikimedia projects, and to lesser extent to improve participation, while the specific project goals are the following:
- Increase support for Open Knowledge / Free Content movement through establishing long term strategic partnership with venerable cultural institution Matica Srpska
- Increase the quality, accuracy and volume of Serbian Wiktionary by 180% through digitization of the Dictionary of Serbian speeches of Vojvodina
- Increase the quality of Wiktionaries through the digitization of the Serbian ornithological dictionary, and its use to establish and test a potential model for future development of cooperation across Wiktionaries
- Reinvigorating communities around Wiktionaries through two targeted mobilization campaigns
Project activities will develop in the following phases:
Initial phase (Pre-grant activities)
- Establishing contact with Matica Srpska
- Organizing communication with Matica Srpska, including two meetings in Novi Sad
- Recruiting a potential programmer
- Recruiting a few volunteers to contribute to the organizational and communication parts of the project
- Preparing the list of lexicography terms for translations.
Q1 – Software Development
- Assessing the potential issues in the digitization, and the challenges for software development
- Opening page on Meta Wiki in order to document the process, the challenges and the issues regarding software development
- Constantly mobilizing Wiktionary communities to participate in the process using pages on MetaWiki, in order to gain valuable input on the desired format of Wiktionary entries
- Mobilizing community to translate the list of lexicography terms
- Developing software
Q2, Q3, Q4
- Input of the dialects of Vojvodina dictionary
- Input of the dialects of Vojvodina dictionary addendum
- Input of the Serbian ornithological dictionary
Activities in the last three phases include:
- Upload of entries to Serbian Wiktionary
- Upload of illustrations to Commons
- Maintaining discussion on meta pages
- Documenting process on meta pages
- Preparing material for mobilizing campaigns
- Mobilizing communities to participate in the project
- Tweaking the parsing software according to community suggestions
- Developing recnici.maticasrpska.org.rs as a platform for Matica Srpska lexicography activities
- Developing dictionaries.interglider.org as experimental platform for work on dictionaries using parsing software tool and Wiktionary model
Cooperation with Matica Srpska
Detailed information about the beginning of cooperation with Matica srpska - the oldest cultural institution in Serbia and the first of its kind in Slavic countries - can be found on the Wikimedia blog.
Matica Srpska is the first Serbian literary, cultural and scientific society. It was founded in 1826 in Pest, during the liberation of Serbia from centuries of Ottoman occupation, and upon the strengthening of the awareness of the need to fully include Serbian people in modern European trends, and to maintain their cultural identity at the same time. For that purpose, Matica developed a rich publishing activity. The basis of this activity was the famous Letopis (Chronicle), first published in 1824, and currently the oldest continuously published cultural and scientific magazine in the world. During the 1840s, Matica developed the conditions required for scientific work. It was then that a library containing literary and manuscript collections from various scientific fields was formed.
Today, Matica Srpska has almost 2.000 associates. They are included in one of the dozens of scientific and development projects within the Department of Literature and Language, Department of Lexicography, Department of Social Sciences, Department of Natural Sciences, Department of Fine Arts, Department of Performance Arts and Music, and the Manuscript Department. The Matica Srpska Library has over 3,500,000 books, and the Matica Srpska Gallery houses a rich collection of Serbian eighteenth and nineteenth century paintings. The Publishing Center continues the tradition of the former Matica Srpska Publishing Company, whose editions were for decades recognizable throughout Southeastern Europe by the emblem MS, which signified high-quality and carefully selected literature from various fields. Matica Srpska annually awards worthy accomplishments in various fields of culture and science.
Matica Srpska has been an example to many Slavic nations. Based on this model the following institutions were established: Czech Matica in 1831, Illyrian Matica in 1842 (in 1874 renamed to Matica Hrvatska); Matica Lužičkosrpska in 1847, Halych-Russian Matica in Lviv in 1848; Moravian Matica in 1849; Matica Dalmatinska in Zadar in 1861; Slovak Matica in 1863; Slovenian Matica in 1864; Matica Opava in 1877; Matica in the Teschen Princedom in 1898. (from which Silesian Matica came to be in 1968); Polish Matica in Lvov (1882); Educational Matica in the Teschen Princedom in 1885; Educational Matica in Warsaw in 1905; Bulgarian Matica in Constantinople in 1909 and the new Bulgarian Matica in 1989.
Besides two initial meetings (July and December) and ongoing communication, the input of Matica Srpska are 2 dictionaries to be digitized: the Serbian ornithological dictionary (370 primary entries) and the dialects of Vojvodina Dictionary (31 309 primary entries). Throughout the process of the digitization, MS experts will be involved and available for expert consultations.
Translating the list of lexicography terms
Important introductory activity would be preparing the list of lexicography terms and organizing its translation. There would approximately 100 terms. For this task we would use MediaWiki instance with Wikidata extension, but all of the terms would be inserted into Wiktionaries, as well (making approximately 10,000 new entries per Wiktionary, counting that the terminology would be translated into 100 languages).
Importing Serbian ornithological dictionary into Wikimedia projects
Serbian ornithological dictionary encompasses all local names of birds living on the Serbian speaking territory. All names are specified under the appropriate Latin name in accordance with the contemporary classification system.
The aim of this dictionary is to record all local names of birds and their determinations within standard Serbian language and to provide recommendation for norming the most frequent bird names in Serbian linguistics, but also in ornithology and zoology in general.
The tasks include the overview and classification of existing linguistic and ornithological literature, excerption of the material, sorting the material in alphabetical order, identification of the species, processing the entries, the creation of a registry, and the creation of illustrations. The final product is the creation of the central registry of Serbian names of birds.
Each entry will also contain an illustration. 370 illustrations of the birds living on Serbian speaking territory would be uploaded to Commons.
Importing The Dictionary of Serbian speeches of Vojvodina into Wikimedia projects
The Dictionary of Serbian speeches of Vojvodina encompasses dialects and includes material from several Serbian speaking areas in the territory of Hungary and Romania. It contains 31.309 primary entries and more than 150.000 samples of live speech and selected written sources, from all localities across Vojvodina.
Parsing this dictionary into the Wiktionary format would demand expert consultations throughout the process and Matica Srpska will provide its experts for this task.
Mobilizing community across Wiktionaries
Having in mind the structure of the Serbian ornithological dictionary and the relatively limited number of primary entries (370), we would like to try to mobilize Wiktionary communities to translate those entries to their local languages. We plan to run two mobilization campaigns in the second half of the project, in order to promote the ornithological dictionary and to ask people to translate it. In order to facilitate the process, we will prepare the page with terms from ornithological dictionary and the accompanying material in a manner utilized in current translation projects, but we would add advocating campaigns every three months. Campaigns would be run by Interglider.COM community managers and Wikimedia volunteers.
Some Wiktionary Structure Issues to be Discussed
The following links Wiktionary future, Wikidata:Wiktionary, Wikidata/Notes/Future#Wiktionary contain information regarding various challenges related to the structure of Wiktionary, potential projects that could resolve some issues (OmegaWiki) and input from community members interested in this issue.
Regular philological dictionaries use lexemes of particular meaning for their entries. Thus, for example, the noun "process" will be one entry in its case-neutral form (usually, nominative singular), while the verb "process" will be the other entry (usually in infinitive form, present form or something similar; in English, it's both shorter version of infinitive ("to process") and the present tense).
Unlike its case with philological dictionaries, an entry (page) in Wiktionary is form-based. Thus, one page will be used for both (nominative singular of) the noun "process" and present/short infinitive of the verb "process".
However, Wiktionary will also have pages for the nominative plural of the noun and third person singular present of the verb -- "processes" --, as well as present ("processing") and past participle ("processed") forms of the verb.
In the cases of the languages with dominant analytic processes, like English is, that doesn't make the number of possible entries inside of the Wiktionary too much different of the number of possible entries inside of one philological dictionary. However, in the cases of the languages with dominant synthetic processes, like Serbian is, that makes the number of entries significantly different.
Our primary numbers are based on the most conservative approach to the content addition, which means that they are based on the idea of adding just the main forms from the dictionaries. However, it is possible to create the generation of inflection of the Serbian language, which would add all of the forms inside of the Wiktionaries. In that case, it's free to say that the number of entries would be raised twenty times. Serbian nouns have seven cases, two grammatical number, Serbian verbs and adjectives have more than 50 grammatical forms each.
Besides that, we are getting the proper philological dictionary of Serbian language, which means that we'll get the entries with diacritics, which are used for marking stress and tones of the words. That means that most of the entries from the dictionaries will have two entries into Wiktionary: one with diacritics, the other one without them.
Finally, the Serbian language uses two alphabets: Cyrillic and Latin. While two differently written forms are not valid entries on most Wikipedia projects, but they are valid entries in Wiktionaries.
If all of the forms are counted, as well as counting that 150,000 secondary references of the Dictionary of Serbian speeches of Vojvodina are actually standard words in Serbian, it would make approximately 180,000*20*2*2=14,400,000 entries for the Serbian Wiktionary.
However, this kind of addition should be handled carefully. Most importantly, we should realize if it's possible to add regularly 150,000 standard Serbian words. If so, Serbian Wiktionary wouldn't be contaminated by millions of dialect forms, without having common words inside of it.
Freeing the content
Matica Srpska is being introduced to the concept of free licenses for the first time. They have agreed to publish material of two dictionaries under the CC-BY license on Wiktionaries, while they will maintain their right to publish printed or pdf version under CC-BY-NC-ND. On one hand this is the usual practice with material that is to be printed, and on the other hand it provides assurance that financial sustainability of Matica Srpska would not be neglected. It is important to emphasize to institutions that freeing content does not impair their ability to have financial gain, especially if that is one of their significant sources of funding.
Developing online platforms for dictionaries
During the project, two online platforms would be created - platform for lexicography activities of Matica Srpska that would include pdf formats of their dictionaries and verbatim searchable version. This platform would be created using software developed during project, which means future cooperation with Matica Srpska on lexicography content would be very easy.
The second online platform would be hosted on dictionaries.interglider.org, developed also on the basis of parsing software. This platform is envisioned as experimental playground for Wiktionary, where we would be able to test various options that we can not on Wiktionary itself.
- Establishing cooperation with Matica Srpska would significantly contribute to credibility of Wikimedia projects in Slavic countries and open possibility of cooperation with more GLAM institutions, thus providing high quality contribution to the Open Knowledge / Free Content movement.
- The content provided by Matica Srpska is to be labeled as of high-quality - as the content of a major lexicographical authority for the Serbian language.
- The contribution of Matica Srpska extends to Commons as well, since the ornithological dictionary will contain illustrations of birds.
- The dialects of Vojvodina dictionary has 31,039 primary entries of the highest quality and around 150,000 secondary references.
- According to http://meta.wikimedia.org/wiki/Wiktionary, the number of entries on Serbian Wiktionary is 17,289 (as of 2014-12-15). Adding approximately 31,000 of entries would increase the number of entries in the Serbian Wiktionary for 180%.
- The Wiktionary is an important project but occasionally stuck, since there is not yet a common solution for the future of this project. This project would initiate structured discussion about the future of the Wiktionary.
- Developing a software tool for parsing philological dictionaries would be an asset for all future Wikitionary efforts.
- By organizing campaigns for translating the ornithological dictionary, we will reinvigorate not only Serbian Wiktionary community, but communities globally.
- Having in mind the structure of Serbian ornithological dictionary - Latin name of bird specie, all Serbian names, recommended name, illustration of the bird; and the relatively limited number of unique entries (370); it would be easy to create pattern for easy translation of the list of these terms.
- So, for example, if we succeed to motivate people from a hundred Wiktionary communities to participate, the number of primary entries to these hundred Wiktionaries would be 3,700,000. If we succeed to motivate just twenty Wiktionary communities, the number of basic entries to these twenty Wikitionaries would be 148,000.
- The potential benefit of ornithological dictionary extends to all existing Wikitonaries.
- Mobilization campaigns would definitely increase participation of Wiktionary community and possibly, during this project, some solutions for future of Wiktionary would be crystallized.
Fit with strategy
What crucial thing will the project try to change or benefit in the Wikimedia movement? Please select the Wikimedia strategic priority(ies) that your project most directly aims to impact and explain how your project fits. Most projects fit all strategic priorites. However, we would like project managers to focus their efforts on impacting 1–2 strategic priorities. Examples of strategic priorities can be found here.
The major impact of the project is achieved through the increase in high-quality content by digitizing two dictionaries provided by Matica Srpska, the leading lexicography institution in Serbia, and by providing accessibility of reliable sources to editors throughout Wiktionary. The content of the Serbian Wiktionary would be increased by 180%, but the impact would also be visible on other Wiktionaries, through translation of the Ornithological dictionary.
The project also significantly increases credibility of Wikimedia projects in Slavic countries due to establishment of long term strategic partnership with the venerable institution Matica Srpska. This cooperation should ensure that Matica Srpska eventually incorporates free licensing in its modus operandi. This also provides major potential from GLAM program in Slavic countries, since Matica Srpska is very influential and provides example to similar institutions. Having in mind that this is the pilot phase of cooperation, we will explore various modalities for future cooperation throughout the project and through regular meetings with Matica Srpska.
On the level of participation, mobilization of the community around Wiktionaries is one of the crucial aspects of the project, and two mobilization campaigns would be organized in order to include as many people as possible in the work on Wiktionaries.
Measures of success
Please provide a list of both quantitative and qualitative criteria that will be used to determine how successful the project is. You will need to report on the success of the project according to these measures after the project is completed. See the PEG program resources for suggested measures of success.
While creating monitoring and evaluation strategy for the project, we did take into account global metrics.
(a) Increase support for Open Knowledge / Free Content movement through establishing long term strategic partnership with venerable cultural institution Matica Srpska
In order to measure the success of this goal we propose indicators referring to the quality and effects of established partnership, especially interaction and multiplication.
- Interaction indicators
- a.1 Number of successful interactions between Matica Srpska staff and Wikipedians (e.g. number of feedback given and taken into account, number of requested media files that were provided by the institution, number of requested pieces of information that were delivered, etc.)
- Methodology: Use a wiki page to register and track requests or feedback by Wikipedians
- a.2 Satisfaction of the Matica Srpska regarding feedback quality
- Methodology: Interview(s) with Matica Srpska staff at mid term and the end of the project, to be part of the reports
- a.3 Extent of the Matica Srpska senior management support
- Methodology: Interview(s) with Matica Srpska staff at mid term and the end of the project, to be part of the reports
- a.4 Extent to which Matica Srpska staff know about "free" copyright licenses and licensing requirements of Wikipedia/Wikimedia
- Multiplication indicators
- a.5 The extent to which the project was further elaborated after the first year
- Methodology: Project monitoring
- a.6 Number of new partnerships initiated thanks to the cooperation with Matica Srpska
- Methodology: Project monitoring, reports and eventual agreements with new partners
(b) Increase the quality, accuracy and volume of Serbian Wiktionary by 180% through digitization of the dialects of Vojvodina dictionary
- Quantitative indicators
- b.1 # of new entries into Serbian Wiktionary
- b.2 # of articles created or improved in Wikimedia projects using new entries in Serbian Wiktionary during the project timeline
- b.3 # of bytes added to or removed from Wikimedia projects using new entries in Serbian Wiktionary during the project timeline
- Methodology: Wikimetrics
(c) Increase the quality of Wiktionaries through digitization of Serbian ornithological dictionary and use it to establish and test a potential model for future development of cooperation across Wiktionaries and (d) Reinvigorating communities around Wiktionaries through 2 targeted mobilization campaigns
The same quantitative indicators should be used for both goals (c) and (d).
- Quantitative indicators
- c.1 # of new entries into Serbian Wiktionary
- c.2 # of articles created or improved in Wikimedia projects using entries from ornithological dictionary during the project timeline
- c.3 # of bytes added to or removed from Wikimedia projects using entries from ornithological dictionary during the project timeline
- c.4 # of images uploaded to Commons
- c.5 # of images or media uploaded to Commons, used in Wikimedia projects during the project timeline
- d.1 # of Wiktionaries participating in cooperation on translating of ornithological dictionary
- d.2 # of new entries into other Wiktionaries - translation of ornithological dictionary
- d.3 # of active editors involved in work on Wiktionaries - translation of ornithological dictionary
- Methodology: Wikimetrics
a.b.c.d The quality analysis of the process would be done through analysis of the pages on MetaWiki, where the entire process for digitization of both dictionaries would be followed, and where the communication with the global community would be established. The quality analysis should provide answers related to the questions:
- What were major lessons learned from this process?
- What were the main problems and challenges?
- What was successful during process?
- What were the most successful methods of advocating?
- What would be the future recommendation?
Note: In addition to your project-specific measures of success, you will also be asked to report on some global metrics at the end of your final report. Please keep this in mind as you plan, and we'll support you as you begin your project.
Resources and risks
Interglider.ORG is currently non-incorporated group of Wikipedians, organized around this project. We do plan to formalize our status in the future period and become an organization working on promotion of free culture and free knowledge.
- Milos Rancic (languages: sr, bs, hr, sh, en-3, ru-2, mk-2, sl-2, bg-2, uk-1, be-1, )
- Milica Gudovic (languages: sr, bs, hr, sh, en-4, mk-2, sl-1, fr-1, ru-1)
- Tuuli Pollanen -- community and developer support (languages: fi, en-4, sl-4, sv-3, de-1)
- Milos Trifunovic
- Interglider.COM community managers
- Senka Latinovic (languages: sr, bs, hr-5, en-4, mk-2, sl-2)
- Irena Antonijevic (languages: sr, en-3)
- Matica Srpska
Besides general volunteer support (i.e. translation for the Ornithological dictionary), we are open for Wikimedia volunteers to participate more substantially and thus build knowledge inside of the community how to deal with this kind of data. For example, if you are willing to join the core team and help us in communication with the Wiktionary communities of the languages which you are speaking, please contact Milica or Milos via email. The same goes if you are willing to contribute by coding in Python and/or PHP. (So, before you sign in before "other Wikimedia volunteers", please contact us.)
- JP Béland (Amqui) -- Communication with French Wiktionary, as well as other projects where French is the lingua franca
- other Wikimedia volunteers
Evidence of past success
The main part of the project relies on Matica srpska, which is the institution with almost two centuries of tradition.
Interglider.ORG is newly formed group, consisted of Milos Rancic, an experienced Wikimedian (member of Language committee, co-founder and the first president of Wikimedia Serbia etc.) and Milica Gudovic, an experienced activist (feminist activist for the past twenty years working on intersection between gender, technology and economy; co-founder (1996) and CEO of the organization Women at Work till 2012; participant of the localization of Creative Commons licenses for the jurisdiction of the Republic of Serbia etc.).
It could happen that Matica Srpska needs more time to provide content than planned, having in mind that dictionaries are in the process of making, especially of ornithological dictionary. Mitigation strategy is negotiating with Matica Srpska to release content partially.
Low response of Wiktionary communities
We need community to participate in discussion on the software, on the future of the Wiktionary and to participate in translations of ornithological dictionary. Mitigation strategy includes hiring two community managers with the responsibility of maintaining communication throughout the project, and who would ensure that participation is as easy and effortless as possible by utilizing all possible communication channels within the community, by taking personal contact with interested editors, and by facilitating the overall process.
Grantees are subject to line-item scrutiny of expenses. Changes to the approved budget beyond 10% in any category must be approved in advance.
- Project budget table
|Number||Category||Item Description||Unit||Number of Units||Cost per Unit||Total Costs||Currency||WMF||Other Funds*||Notes|
|Developers tasks include: parsing software development, cooperation with experts, input of material to Wiktionaries. It is estimated that software development will take approximately 400 hours, while input of entries of Ornithological dictionary, Dictionary of Serbian Speeches of Vojvodina and addendum to the later will take 280 hours.|
|Technical Support||Ornithological dictionary – Expert Consultancy||Hour||
|Expert from Matica Srpska will be available to developer and our supporting team for the following tasks: overview and classification of existing linguistic and ornithological literature, excerption of material, sorting material in alphabetical order, identification of species, processing of entries, creation of registry, creation of expert feedback to community questions.|
|Technical Support||Dictionary of Serbian Speeches of Vojvodina – Expert Consultancy||Hour||
|Expert from Matica Srpska will be available to developer and our supporting team for the following tasks: overview and classification of existing linguistic literature, excerption of material, sorting material in alphabetical order, identification and processing of entries, creation of expert feedback to community questions.|
|Technical Support||Community Manager||hour||
|Community manager will help us keep the momentum in discussions with communities, provide support in organizing mobilization campaigns and be responsible for overview of communication within the project.|
|Travel||Travel costs for meetings||Month||
|During project time, developer, experts and organization team need to meet for various purposes. Matica Srpska is in Novi Sad and organization team is in Belgrade. The amount covers 2 round trips for 2 persons per month, including daily allowance.|
|Technical Support||2 Online platforms||Item||
|Developing online platforms for MS and experimental playground|
|Technical Support||Hosting and administration||Month||
|*Interglider.com will fund additional costs of the project|
- Total cost of project
- Total amount requested from the Project and Event Grants program
- Additional sources of revenue that may fund part of this project, and amounts funded
- $19,840 is contribution of Interglider.com, private company owned by Milos Rancic
See a description of non-financial assistance available. Please inform the Wikimedia Foundation (WMF) of any requests for non-financial assistance now.
- Requests for non-financial assistance, if any
- No non-financial assistance is needed.
You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a village pump, talkpage, mailing list. Please paste a link below to where the relevant communities have been notified of this proposal, and to any other relevant community discussions. Need notification tips?
- Serbian Wiktionary
- Serbo-Croatian Wiktionary
- English Wiktionary
- Slovak Wiktionary
- Greek Wiktionary
- Italian Wiktionary
- Turkish Wiktionary
- Spanish Wiktionary
- Portuguese Wiktionary
- Lithuanian Wiktionary
- Tamil Wiktionary
- Dutch Wiktionary
- Kannada Wiktionary
- Vietnamese Wiktionary
- Esperanto Wiktionary
- Danish Wiktionary
- Bosnian Wiktionary
- Chinese Wiktionary
- Cheeroke Wiktionary
- Kurdish Wiktionary
- Norwegian Bokmål Wiktionary
- Simple English Wiktionary
- Wikipedia communities
- Relevant mailing lists
- Language committee
Do you think this project should be selected for a Project and Event Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.
- I think this project has great potential to increase content in Serbian language in different Wiktionaries and paving the road for more cooperation with Matica Srpska as well as starting global linguistic cooperation on Wikimedia projects. I will provide support and help for French projects. I trust Milos to bring this project to success. Amqui (talk) 20:39, 27 December 2014 (UTC)