Grants:Project/jakerylandwilliams/Language learning tools that help expand Wiktionaries through attestation, citation, and example usage

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
statusinactive
Language learning tools that help expand Wiktionaries through attestation, citation, and example usage
summaryThis project aims to help individuals understand languages through the Wiktionaries' resources while building up their repositories of information in the process.
targetWiktionary
amount94,597 USD
contact• Jake[dot]Williams[at]drexel[dot]edu
this project needs...
contact
organization
advisor
volunteer
join
endorse
created on21:52, 28 February 2017 (UTC)


Project idea[edit]

What is the problem you're trying to solve?[edit]

This project seeks to solve several problems, centered around the Wiktionary. The responsibilities of the Wiktionaries' editorial communities include the recording of citations, attestations, and example usages. These are challenging tasks that must account for diversities amongst dialects, the constant emergence of new words and phrases, and their ongoing evolution in rapidly-changing online environments. At a basic level, even being aware of new phrases is challenging, let alone the attention required for discovery of appropriate example usages and citations. At the same time, the collection of Wiktionaries have become core resources for language learners, who rely on their extraordinarily extensive records of conversational, idiomatic, and slang phrases that are generally not found in traditional dictionaries. In their usage of Wiktionaries, language learners are often focused on understanding specific passages in their possession, that if reviewed by their editorial communities could provide significant contributions as examples in the dictionaries. Finally, while the Wiktionaries' data have a strong record of usage in the natural language processing (NLP) community, a number of intrinsic features limit their full potential from being realized. So, while the projects produce data that could drive a full suite of open-source tools to provide users with more-advanced language support features, like machine translation, the present modes of data collection and annotation are non-ideal.

What is your solution to this problem?

This project synergistically addresses issues surrounding (1) the Wiktionary's continued growth; (2) its support of language learners; and (3) its capacity for use in language processing scenarios. The proposed solution to these issues consists of the development of a single interactive feature, geared towards language learners who will voluntarily contribute information as a side effect of their education. In language learning classrooms, the production of phrase-cued text has become a common exercise for the development of prose and fluency. The creation of phrase-cued text is an annotation exercise in which learners read passages and place dividers, chunking text into "meaningful units" and identifying key pauses. These identifications help learners identify challenging phrases and to pause appropriately for comprehension while they read, even while unaware of the grammatical structures and definitions that underpin their annotations. The proposed interactive feature may be thought of as a reading tool that will allow users to upload text and identify meaningful phrases and their definitions as an active reading process.

Project goals[edit]

Through this tool, the phrases that users identify will be linked back to entries on the Wiktionary so that they can responsively see definitions as they read. The act of reading in this way will provide readers with the benefits of the phrase-cued text exercise, in addition to the full reference of the Wiktionary as a resource. While the users of this feature will stand to gain fluency and expand their vocabularies, records of the readers' annotated data will be kept for other uses. These data may contain excellent example usages of phrases, which are conveniently linked to dictionary entries by human annotators. Additionally, the linked data may include phrases that are not yet stored in the Wiktionary, resulting in expanded coverage. Finally, the annotations would be of great value to the language processing community, who could use them as "gold standards" for development. The resulting data from this feature are essentially similar to what is needed for the development of algorithms that identify multiword expressions (MWEs). Automatic MWE identification is a very challenging language processing task that remains largely unsolved because of a lack of "gold standard" training data. In the long term, satisfying this project's immediate goals this task may result in much improved machine translation tools, capable of translating non-literal expressions. Working with the Wiktionary to produce data in this way will place it in a central position, driving the development of open-source tools for the use and benefit of humanity.

Project impact[edit]

How will you know if you have met your goals?[edit]

Success at accomplishing this project’s goals may be measured by a number of criteria. If there is significant use of the reading tool, i.e., traffic on the developed web interface, then the project will have found utility amongst individuals who use the Wiktionaries' reference materials. Furthermore, success at this project will be measured by the effects that it has on the growth of the Wiktionary. From the data that this feature collects and relies upon, it will be possible to monitor the inclusion of found attestations, examples, and even definitions in the Wiktionaries. Finally, the success of this project will be measured by its ability to generate valuable data for language processing. As the data from this project are collected, it will be straightforward to measure improvements to existing systems. In the long term, success in this area will be measured by the improvements to, and the increased availability of resulting open-source tools. For example, this and similar strategies may be used far down the line for the Wiktionary to offer its own suite of language processing tools, e.g., a Wikimedia-powered natural language translation service.

Do you have any goals around participation or content?[edit]

This project has both long- and short-term goals aimed at expanding participation in the Wikimedia movement as well as improving and expanding its content. These goals are all specifically with respect to the Wiktionaries, and apply directly to the third shared metric: Number of content pages created or improved. By engaging users to upload and annotate the text they aim to understand through a Wiktionary's reference materials, the Wikimedia community will be provided numerous in-context examples of terms. These may be used to expand the example usages and attestations present in Wiktionaries, which will in turn be richer resources. Additionally, for terms that have no reference in a language's Wiktionary, this project creates a novel procedure for users to offer new entries with example usage attached. It is also worth noting that while this project increases and improves the contents of Wiktionaries, it will be doing so by engaging a broader community of volunteers—the Wiktionary users, themselves.

Project plan[edit]

Activities[edit]

Technical activities[edit]

The main activities required to carry out this project are centered around the development of a reading interface that links to Wiktionary content. Linking to Wiktionary content is not a trivial task, as these resources are only semi-structured and focused for human processing. In order to be able to provide a user with detailed information on a phrase's possible definition as it is being annotated, it will be necessary to machine-process existing entries for their different senses and example usages. As a language processing researcher, this is a specific task with which the project's proposer already has extensive experience. However, time will be taken to refine processing methods to be as extensive and correct as is possible. The most significant activity that will need to be approached is the task of developing a web-sever and front end. These facilities must be capable of accepting text uploads and rendering them in a format where they may be rendered for annotation and linked to Wiktionary data. As these are skills that fall outside of the project proposer's expertise, it will be necessary to hire out programming support. Once these two basic elements are established, the project will be able to be deployed, and engage actively in interaction with users and the collection of their data. Finally, in order to be able to expand and improve Wiktionary content, it will be necessary to establish protocols with the editorial communities to pass along the annotated data to be incorporated into Wiktionary references.

Community engagement[edit]

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

Wikimania 2017 workshop[edit]

Budget[edit]

The proposed project requires funds to pay a programmer to develop its web interface. This development will be concentrated through six months of full-time work over the summer and fall of 2017. The proposer-faculty data scientist will be paid for focused work that will be performed in summer of 2017 in conjunction with the staff programmer, and then volunteered throughout the remainder of the project's term. The proposer will also volunteer time ongoing, in maintenance and improvement of the project.

Item Cost Note
Full time software developer support (40 hours per week for 6 months) 33,500 (base) + 11,022 (benefits) = 44,522 USD 1
Full time data science faculty support (40 hours per week for 1 month) 10,853 + 3,571 (benefits) = 14,424 USD 2
Travel for offline workshop at Wikimania 2017 (1 round trip airfare, five nights room and board) 1,500 USD 3
Indirect costs (56.5% of direct costs, above) 34,152 USD 4
Total = 94,597k USD

Notes

[1] Average salary for full-time software developers in Philadelphia, PA is 67k USD[1], the project plans to offer a salary of 33.5k USD in addition to benefits contributions totaling 44,522 USD.

[2] Faculty salary is based on one month of an academic year appointment (10,853 USD), and is budgeted in addition to benefits contributions for a total of 14,424 USD.

[3] Based on average cost of round trip flight on expedia.com (500 USD), average cost of three-star hotel room at trivago.com (600 USD), and per-diem estimates from gsa.gov (400 USD).

[4] Drexel University's indirect cost rate is calculated at 56.5% (34,152 USD) of direct costs (60,445 USD).

Sustainability[edit]

This project expects expects adoption by a broad web community of language learners as an extension of Wiktionary reference. Through the use of the educational facilities that this project offers, resulting data may be adopted for use by a number of parties, including Wiktionary editors and scientific researchers. So, as an ongoing utility this project will continue to develop through its use. So, for its educational value and contribution to the scientific community in language processing as a resource for data collection, the proposer is committed to ongoing maintenance and expansion of the project as volunteer work for the Wikimedia community and as a regular work-affiliated research activity.

Opportunities for growth[edit]

The scope of this project is immediately focused on the development of materials for language learning that expand Wiktionary capabilities. However, the conceptual intent for this project to simultaneously provide valuable research data may have lasting impacts both on the development of open-source tools, and as a model for the development of further scientific data collection methodologies. While this project's output data has been identified as valuable for a specific language processing task, its success and the Wikimedia foundation's philosophical alignment with open science could open the door to similar projects that directly provide services, while collecting essential open data for scientific advances that may positively impact a broad community without relying on commercialization. As a result, the data directly produced through this project and any produced through this project as a model, may lead to the development of improved language processing tools that, being based on open data, will have to capacity to benefit the Wikimedia community and its user base, in general.

References[edit]

Get involved[edit]

Participants[edit]

This grant would be awarded to Dr. Jake Ryland Williams as an Assistant Professor of Information Science at Drexel University. Jake has been deeply invested in the Wikimeda movement since its inception, and has an long history of use of its projects' data in education and research, going back to 2009. He received his Doctor of Philosophy in Mathematical Sciences from the University of Vermont in 2015, where he performed dissertation research that resulted in work published through the statistical physics community focused on the automated identification of missing dictionary entries from the English-language Wiktionary. This results of this research resulted in the inclusion of a number of dictionary entries, whose success to be monitored as an ongoing experiment. After receiving his doctoral degree, he moved into a postdoctoral position at the University of California, Berkeley, where he performed a self-directed research in data science, and provided instruction in scalable machine learning coursework at the masters level. During this period, he also presented his past research on the English-language Wiktionary at the Wikimedia foundation's headquarters in San Fransisco, to a warm reception and interest for continued work. Since leaving the University of California for a faculty position at Drexel University, Jake has engaged in the development of an undergraduate curriculum in data science, and continues with research in natural language processing, machine learning, and computational social science, committed to data science research and the development of open-source tools and methodologies that benefit humanity.

Community notification[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).