Grants:IEG/A visual representation of the etymology of words using trees

statusdraft

A visual representation of the etymology of words using trees

summaryThe aim of the project is to develop a tool to extract etymological relationships from Wiktionary, to build a database of etymological relationships, and to develop an interactive and open source tool to visualize the etymological tree of words in Wiktionary.

targetWiktionary potentially in any language (starting with English)

strategic priorityimproving quality

amount30000 USD (in EUR)

grantee• Epantaleo

contact• esterpantaleo

gmail.com

this project needs...

volunteer

give feedback

join

endorse

created on00:30, 15 December 2015 (UTC)

round 2 2015

Friendly space expectations

Note[edit]

This is an older version of the project funded by an IEG grant which can be found at https://meta.wikimedia.org/wiki/Grants:IEG/A_graphical_and_interactive_etymology_dictionary_based_on_Wiktionary.

Project idea[edit]

Etymological definitions in Wiktionary often consist of fairly long and complex sentences that describe etymological relationships between words; these relationships are often spread over multiple Wiktionary pages (i.e., pages of etymologically related words).

We believe that a visual, rather than textual, representation of etymologies in the form of a tree (or more precisely, a graph) that uses nodes for words and links for etymological relationships will be more intuitive, easier to maintain, easier to develop, more informative, and also multilingual.

We propose to develop a visual and interactive tool that we call etytree. A demo is available here. The tool is based on d3.js, a JavaScript library for manipulating documents based on data. The tool will use data from a database of etymologically related words that we will build using an extractor of etymology from Wiktionary (using DBpedia's Wiktionary extraction framework after editing its configuration files). Both the database and the visualization tool (as well as the etymology extractor) will be open source and available for the community.

When searching for the etymology of a word (e.g., the English word 'butter'), the tool will show the etymological tree of the searched word, with links to multiple words that are etymologically related with the searched word across languages (a static screenshot of the etymological tree of the English word 'door' is shown below). Users will be able to interact with the tree to visualize definitions and properties of the word, expand/collapse nodes.

What is the problem you're trying to solve?[edit]

We believe that Wiktionary etymologies can be:

More intuitive: Wiktionary etymological definitions can be fairly long and complex. For example the Wiktionary etymology of the English word butter is 'From Middle English, from Old English butere ‎(“butter”), from Proto-Germanic *buterǭ ‎(“butter”) (compare West Frisian buter, Dutch boter, German Butter), from Latin būtȳrum, from Ancient Greek βούτῡρον ‎(boútūron, “cow cheese”), compound of βοῦς ‎(boûs, “ox, cow”) and τῡρός) ‎(tūrós, “cheese”), from Scythian.'; a tree can represent in a single image all these etymological relationships;
More informative/Less redundant/Less prone to errors: Etymologically related words are described in separate Wiktionary pages and the user has to browse/edit multiple pages (cfr. butter, butere, *buterǭ, buter, boter, Butter, būtȳrum, βούτῡρον, βοῦς, τῡρός);
Easier to export into a database: Wiktionary etymological definitions use templates for words and properties (cfr. {{etyl|enm|en}}, {{etyl|ang|en}}, {{term|butere|butter|lang=ang}}, {etyl|gem-pro|en}}, {{m|gem-pro|*buterǭ|butter}}) but don't have templates for etymological relationships (cfr: 'from', 'compound of', 'compare' in the example above); a representation of etymologies using trees, i.e., data structures with nodes representing words and links representing etymological relationships, can be naturally converted into a database of (multilingual) etymological relationships;
Shareable across language versions of Wiktionary: Etymological relationships (e.g., word A in language 1 derives from word B in language 2) are independent of the language used to describe them and could be shared across language versions thus increasing the completeness of Wiktionary etymologies. For example, the etymology of the Italian word latte ("milk") in the Italian Wiktionary: 'dal Latino lac lactis' ('from Latin lac lactis') links to the inexistent word lac lactis, while the etymology of the Italian word latte in the English dictionary (which has more users) links to a detailed etymology of the Latin word lac,lactis: 'From Proto-Indo-European *ǵlákts (gen. *ǵlaktós) (compare Greek γάλα ‎(gála, “milk”), Old Armenian կաթն ‎(katʿn), Albanian dhallë ‎ (“buttermilk”), Waigali zōr ‎(“milk”), Hittite [script needed] ‎(galaktar, “balm, resin”))'.

What is your solution?[edit]

We propose a visual representation of etymologies using trees with nodes for words and links for etymological relationships that, compared to the current textual representation of etymologies, can be:

More intuitive/informative/interactive: In a single image, the tree can visualize the etymology of multiple words that are etymologically related across languages. Furthermore, the interactive tool can make the search for etymologies a discovery process: users can discover new words that derive from the same ancestral word, both in their own language and in other languages, and can click on nodes (to collapse/expand the tree), mouse over nodes and links to see their properties, zoom/pan, navigate the dictionary;
Easier to edit/Less redundant/Less prone to errors: Etymological trees are independent of the language in which they are presented, and thus can be shared across different language versions of Wiktionary; Editors can edit the same data structure for etymologically related words in any language and can interactively edit both the structure of the tree (by adding/removing/moving nodes and links) and the properties attached to nodes (word definitions, translations, year of origin, type - e.g. noun, verb, adverb - etc.) and to links (the word is an abbreviation, a derived word, a borrowed word, etc.). The appropriate Wiktionary template will be automatically called by the editing tool;
Naturally exportable into a database: The tree data structure is based on a database of nodes (words) and links (etymological relationships) built from Wiktionary data. The database can be made available to the research community.

Project plan[edit]

The aim of the project is to develop etytree, the etymology visualization tool, that should handle different scenarios: abbreviations, compound words, controversies, etc., and to build a database of etymological relationships from the English Wiktionary for at least 10000 words. In order to generate the database we will develop ety2data, an extraction tool to populate the database using Wiktionary etymology entries, and data2tree, a tool to generate the tree data structure from the database. Furthermore we aim to develop an interactive editing tool tree2ety, a tool to edit the tree (nodes, links, and their properties) and generate a textual etymological definition that can be exported to Wiktionary.

The most complex part of the project will be the development of ety2data, the extraction tool that will populate the database using Wiktionary etymology entries. This is because the etymology section of words in Wiktionary is textual and because an etymological tree is built from the etymological definition of multiple words. Templates used in etymology definitions (at least in the English version of Wiktionary) will facilitate the extraction of the tree structure from Wiktionary.

The project will iterate over 4 steps:

Develop etytree to visualize an etymological tree from a JSON file. A demo of etytree is available here and the associated code is available in the Github repository and uses the open source javascript library d3.js;
Develop ety2data, an extraction tool that extracts etymologies from the English Wiktionary and converts them into a graph database of nodes (words) and links (etymological relationships); For this purpose, we want to use the DBpedia XML wiktionary configuration files and modify their etymology block extractor to match wikitext templates patterns like {{etyl}},{{term}},{{m}},{{back-form}},{{compound}},{{blend}},{{rfe}},{{etystub}},{{derived}},{{inherited}},{{cognate}},{{suffix}},{{prefix}},{{calque}},{{borrowing}},{{learned borrowing}},{{rfv-etymology}} using regular expressions and we want to use inference to derive the tree structure of etymological relationships;
Build a graph database of nodes and links for a sample of words from the English Wiktionary;
Develop data2tree, a tool that, from a database of nodes and links, infers the JSON file.

Activities[edit]

We will start from step 2 above - we already produced a demo of the tool in step 1 - , i.e., we will first develop the etymology extraction tool ety2data and, simultaneously, we will build a graph database of the extracted nodes and links (step 3).

Then we will test the extraction tool on a small sample of a 100 words. In order to test the extraction tool, we will use the visualization tool etytree. This will require developing data2tree (step 4), a tool to infer the JSON file of the tree from the database of nodes and links.

At this point we will recursively extend the sample and test the extraction tool and the visualization tool until we reach a sample of 10000 words.

When the size of the sample increases, we expect the number of extraction rules in ety2data and the different visualization options in etytree to increase.

This work will result in a database of nodes and link that can eventually be made part of Dbnary or DBpedia.

Finally we will develop tree2ety, a tool to edit the tree and print a textual equivalent to the etymology tree.

Budget[edit]

Cost of one project leader and developer over 6 months: 30000 USD
Total Budget: 30000 USD (in EUR)

Community engagement[edit]

During the development of the project, input will be gathered from mailing lists and groups, specifically through the Beer Parlour, the Etymology Scriptorium, the Greas Pit for etymology related issues and the DBpedia developers and ontology mailing lists meadbpedia-developers@lists.sourceforge.net and dbpedia-ontology@lists.sourceforge.net ontology for DBpedia related issues. Volunteers can join at any time.

Sustainability[edit]

All tools (etytree, ety2data, data2tree, and tree2ety) will be open source and users will be able to contribute to them at any time both during and after the project has been completed. The software will be available on Github. In particular, etymological trees will be editable both directly, by clicking on nodes/moving nodes, etc., through tree2ety, and indirectly, by editing the textual part of the etymology in Wiktionary.

The project could be extended to include more words, ideally the full set of words in the English Wiktionary and in additional language versions. Because the structure of the tree is language independent, the textual part of the tree (definition of words, language, etc) could be translated into different languages. Also the procedure outlined for this project could be iterated over different language versions of Wiktionary.

Measures of success[edit]

A database of 10000 words will be generated, describing nodes and etymological relationships between words.
Using etytree, ety2data, and data2tree a corresponding set of etymology visualizations will be automatically generated from the database.
An interactive Wiktionary etymology editing tool tree2ety will be developed.

Get involved[edit]

Participants[edit]

Ester Pantaleo is a PhD in Physics and a freelance data scientist. She has done research on different types of data including Finance and Genomics data. She recently became interested in data visualizations and the semantic web.

Volunteer I am not sure how much time I have to contribute, or what exactly needs to be done, but I am a software developer with a very strong interest in historical linguistics. I would like to know more about what I would be able to do to help. Arun.alejandro (talk) 04:47, 6 October 2016 (UTC)

Community Notification[edit]

The community has been notified of my proposal through the Wiktionary Etymology Scriptorium and the Wiktionary Beer Parlour.

Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

Community member: add your name and rationale here.
This would be a useful tool to understand the "spatial" relationships of language 2601:145:4002:4889:C86D:A7EA:F7F5:3690 19:53, 8 October 2016 (UTC)