Grants:IEG/A graphical and interactive etymology dictionary based on Wiktionary/Renewal

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search


This project is requesting a 8-month renewal of the grant.

In the first six months of this project we developed a tool to extract etymological relationships from the English Wiktionary, we built a database of etymological relationships, and we developed etytree, an interactive and open source tool to visualize the network of words etymologically related to a searched word. For a short guide on how to the use etytree and how to query the database, see here.

We are requesting a renewal to improve both the visualization and the extraction method, and also to do dissemination about the project as described below.

Scope[edit]

The etytree project proposed in the initial IEG proposal had the complex goal of creating from scratch a new interactive tool to explore etymological relationships as well as the goal of extracting etymological relationships from the English Wiktionary. The new tool is now available at http://tools.wmflabs.org/etytree/etymology/resources/html/index.html together with a database of more than 6 million etymology entries in 3365 languages that can be queried at http://etytree-virtuoso.wmflabs.org/sparql.

Although the goal of the IEG project has been reached, six months were too short to optimise the tool as well as to attract a large community of users: priority was given to setting up the infrastructure and basic functionalities.

There are three main aspects that would require further work:

  • Visual interface improvement;
  • Improvement of Etymology sections parsing and improvement of queries;
  • Community dissemination.

The following sections summarize major changes that we want to implement in etytree to improve its performance. Once the performance has improved, the data produced could be exported to Wikidata using the Primary Sources Tool, i.e., after validation.

Visual interface improvement[edit]

The current version of etytree uses graphs instead of trees, as originally proposed, because etymological relationships extracted from the English Wiktionary can form loops and loops cannot be represented with trees (tree branches don't merge back to the tree). For example, the etymology section of a Wiktionary page could say that word A derives from word B, which in turn derives from word C, and the etymology section of another page could say that word A derives from word C. This will result, after extraction, in a graph with loops which cannot be represented with a tree.

Graphs are therefore more appropriate to represent the underlying data. However, a tree structure is much easier to read.

Data with a cyclic graph structure can be filtered into a graph with a tree-like structure through an algorithm that breaks loops and only visualizes some of the extracted relationships.

We would like to have two different versions of etytree:

  • A version for the general users that uses trees with a graphics based on the demo here;
  • A version for expert users and editors that uses graphs where all extracted data is shown, without filtering (the current version here).

We want to have two versions of the visualization because we want to faithfully represent data contained in Wiktionary: we don't want to just show the filtered data, but we want to have both a version with the original unfiltered data, i.e., the version that uses graphs, and the version with trees, as in the demo. The idea is that users interact with the tree visualization: if they want to see the original data structure they can switch to the graph visualization (with loops) and understand where the data comes from.

Another possibility could be using graphs only, with a new type of visualization where ancestors are always positioned to the left of descendants. This could be attained using d3.js force directed graphs with the addition of a new force (besides a repulsion force and a spring force) from a magnetic field that orients all arrows in the graph from left to right. See illustration below. Although interesting this is not very easy to implement.

This is a mock up of what the visualization of a directed graph in d3 would look like if a magnetic field is added to the set of forces.

Improvement of Etymology sections parsing and improvement of queries[edit]

Graphs produced by etytree sometimes contain incorrect etymological relationships due to imprecisions in the data extraction method. Below we will list some of the issues we have identified and that can be fixed to return more precise data. Also, in some cases, etytree returns an error message "Sorry, the server cannot extract etymological relationships correctly for this word. We are working to fix this!". Try for example word "parsley". The reason this happens is because when users search for the etymology of a word, etytree queries the database, and the answer can be slow if the graph contains words that have many connections, e.g. affixes. This can be fixed through an improvement of queries and filtering of highly connected nodes.

To improve data extraction we would like to:

  • Tailor parsing of etymology sections based on the specific language;
  • Process wiki links in long detailed etymologies;
  • Process diacritical marks;
  • Add transliteration;
  • Improve software documentation;
  • Improve queries.

In section "More details about planned improvements" we describe each point in detail.

Community dissemination[edit]

This is the most important task we want to accomplish in this renewal for three different reasons:

  • To attract expert Wiktionarians that can help parse etymologies, spot both inconsistencies in the English Wiktionary and bugs in procedures used by etytree to extract data;
  • To attract users (Wiktionary users and new users);
  • To attract editors (Wiktionary editors and new editors).

For this reason we would like to present etytree at both small meetings (in coworking spaces or meetups) and at bigger meetings: Wikimania (see talk proposal here), local Chapters, Academic Conferences. Besides presentations we also want to post about this project on Twitter, Reddit, visualization and linguistics websites, as well as mailing lists.

Budget[edit]

Budget Breakdown[edit]

Item Area Description Commitment Cost
Visual interface and parsing method improvement Software development Responsible for the whole work package Full time (40 hrs/week), 8 PM(1) 38,400 €
Dissemination Participation to at least one international conference (on linguistics/semantics/open data or related) Registration, travel, board & lodging Una tantum 2,079 €
Presentation in England, Italy, France, Germany, Switzerland, Portugal at a local Chapter/University/Meetup/Coworking space (2) Travel & lodging Una tantum 6x400 €
Presentation at http://www.polyglotbratislava.com/en/ (accepted) (3) Registration, travel, board & lodging Una tantum 690 €
Total 43549

(1) Person Months, 30 € per hour

(2) We picked these countries because of personal contacts, but more/different countries can be visited if the opportunity arises (invitations are welcome).

(3) 70 € (registration fee )+ 50 € (per night accommodation) x 7 days + 250 € (transportation from Bari to Bratislava)

Rationale[edit]

The current version of the tool although functional is somehow difficult to use, as the graphical interface is not as linear as it was in the demo with trees.

In order to have a tree structure, the data need to be filtered. If loops exist, they need to be broken to produce a tree structure from a graph structure.

Also, the first version of the software uses routines that are independent of the language of the extracted lexeme. However, some languages use specific structures that are not common to all languages. To improve performance, etytree needs to be tailored to specific languages.

Finally, interaction with the community throughout the project is fundamental, as contributors know common practices and specific aspects of Wiktionary. We want to interact during the whole eight months with frequent meetings in person (at least one per month), and emails/chats/social media at least twice a week. Also at the beginning of the project (in the first month) we will draft a plan for participation to a conference and we will plan meetings for the next seven months.

More details about planned improvements[edit]

Improve parsing of specific languages[edit]

For some languages, e.g., Ido, Sign Languages, Esperanto, Lojban etymology sections use a non-standard structure. The first version of etytree looks for the main recurrent structure and parses etymology sections in the same way for every language. More specifically it looks for the following regular expression

   (FROM )?(LANGUAGE LEMMA |LEMMA )(COMMA |DOT |OR )

where

  • FROM can be any of the following:
   "[Ff]rom", "[Bb]ack-formation (?:from)?", "[Aa]bbreviat(?:ion|ed)? (?:of|from)?", "[Cc]oined from",
and many more,
  • LANGUAGE corresponds to the etyl template,
  • LEMMA corresponds to different templates in practice (e.g. m, l, etc, generally embedding lexemes) or wiki links,
  • COMMA corresponds to “,”,
  • DOT corresponds to “.” or “;”,
  • OR corresponds to “or” (neither followed nor preceded by a character).

For example, a typical etymology section for Ido lexemes (Ido "abato") looks like this:

   ===Etymology===
   From {{etyl|en|eo}} {{m|en|abbot}}, {{etyl|de|eo}} {{m|de|Abt}}, {{etyl|ru|eo}} {{m|ru|абба́т}}, {{etyl|it|eo}} {{m|it|abate}}.

or

  (FROM )?(LANGUAGE LEMMA |LEMMA )(COMMA |AND )

and all lemmas are ancestors of Ido "abato".

Process wiki links in long detailed etymologies[edit]

In the current version, etytree extracts from Wiktionary etymology sections words embedded in both links and templates. However, Wiktionary etymology sections sometimes use wiki links not only for words that are etymologically related to the entry but also for words that describe the entry, which means that some of the extracted relationships are possibly wrong.


One example of etymology section that cannot currently be parsed correctly by etytree is Davidsen:

===Etymology===

Originally a [[patronymic]] from {{suffix|David|sen|lang=da}}.

Word "patronymic" here is not etymologically related to “Davidson”. To facilitate data extraction [[Appendix:Glossary#patronymic|patronymic]] cold be used instead. Similarly for words like "ablative", "zero-grade", etc.

Other lexemes that usually have non-standard etymology sections are phrases, e.g. until the cows come home has the following etymology section:

===Etymology===

Possibly from the fact that [[cattle]] let out to pasture may be only expected to return for milking the next morning; thus, for example, a party that goes on “until the cows come home” is a very long one. Alternatively, the phrase may have a Scottish origin,<ref>See, for example, {{cite-web|title=Till the cows come home|url=http://www.phrases.org.uk/meanings/382900.html|archiveurl=https://web.archive.org/web/20160611134612/http://www.phrases.org.uk/meanings/382900.html|archivedate=11 June 2016|work=Phrase Finder|accessdate=30 March 2013}}.</ref> and may derive from the fact that cattle in the [[w:Scottish Highlands|Highlands]] are put out to graze on the [[common#Noun|common]] where grass is plentiful. They stay out for months before scarcity of food causes them to find their way home in the autumn for feeding.

To address this problem we want to both interact with the editing community to encourage the use of more standard rules and apply some more refined extraction procedures to reduce the number of etymological relationships that are incorrectly extracted by the tool. A template (for example {{detailed etymology}}) before long descriptive etymologies that don’t have a standard chain of etymological relationships could signal to the extraction algorithm to ignore that section.

In the current version, etytree parses links like cattle in the etymology section above as an ancestor of until the cows come home and therefore infers an incorrect etymological relationship. While in the first version we decided to keep those links, we would like to open a discussion with contributors about introducing a new template to deal with such a situation. At the same time we will refine te extraction method to deal with situations like these.

Diacritical marks processing[edit]

For some languages, there are variant spellings of the same word: for example the Arabic "قَهْوَة" ‎(qahwa) with diacritics and "قهوة" without are variant spellings of the same word, and the same thing is true for Old English "ċēap", "cēap", and "ceap". The tool should show the form with the most diacritics, as it has the most information (that is, "قَهْوَة" ‎and "ċēap"). This specific improvement has been suggested in the Etymology scriptorium. Such rules are described in special Wiktionary pages, e.g., page About Arabic.

Transliteration[edit]

Translit_module automatically generates transliterations for the English Wiktionary (for example, transliteration of Russian words) and its functionalities should be ported into etytree.

Software documentation[edit]

We want to write a detailed description of rules used by etytree to extract data from the English Wiktionary.

Queries[edit]

Highly connected nodes[edit]

For some words ("water" for example), etytree fails to return a graph of etymological relationships. It returns an error: the server is taking too long to return data. The server is slow because some lexemes (affixes, for example) are connected to a large number of lexemes and the query used to retrieve them is slow. For example, English "-ly" has 7070 etymological connections, Italian "-mente" has 2035 connections. A more efficient query to the Virtuoso database management system could improve performance.

Extract the full tree[edit]

As of now, queries do not return the full etymological tree/graph but only extract partial trees/graphs. We would like to rewrite queries to generate the full graph of etymological connections.

Additional (facultative) improvements[edit]

We want to list here some facultative improvements that we would like to work on but that we believe are not essential for a successful outcome of this project.

Internationalization/Localization[edit]

The structure of the etymology tree/graph of any word (e.g., its descendants, its ancestors) is the same in any language: only language names (for example: the node tag "eng" would correspond to "English" in English and to "Inglese" in Italian), parts of speech (for example: in the tooltip, where we put the part of speech, we would have "Adjective" in English, and "Aggettivo" in Italian), and definitions need to be changed based on user languages. For definitions, DBnary translation extraction tool, which has already been implemented for 16 languages, could be used.

For this purpose, jQuery's plugin jQuery.i18n jointly with queries to the database could be used.

Specification of etymological relationships[edit]

Right now only 4 types of links exist:

  • "etymologically equivalent to",
  • "etymologically derived from",
  • "derived from",
  • "descends from".

If two nodes are linked by "etymologically equivalent to" they are merged into one node. If two nodes are linked by "etymologically derived from", "derived from", or "descends from" they are simply linked in the visualization (no special link exists for now, the predicate only specifies which section the triple was extracted from: Etymology section, Derived terms section or Descendants section). We could expand the ontology to include more predicates like

  • "borrowed from",
  • "calque of",
  • "inherited from",

etc., and add a tooltip on links between nodes in the visualization to specify the relationship type.

The Virtuoso facetted browser[edit]

As of now the Virtuoso facetted browser links to linguistic data at http://kaiko.getalp.org/. However that database is not identical to ours as it contains data extracted from an older version of the English Wiktionary and with an older version of DBnary_etymology. It should instead link to internal data.

Measures of success[edit]

At the end of the project we will have:

  • Two versions of etytree, one for everyone (with trees), and one for expert users, where the full network of etymological connections is shown (with graphs);
  • Presentation of the project to at least 6 meetings in at least 4 different countries with people from the Wikimedia community;
  • Dissemination on the media described in the Notification section;
  • Proof of interaction with at least 30 Wiktionary contributors;
  • Improvements to parsing of Ido, Esperanto, Lojban, Arabic, Latin, and other languages (at least 5) that use special rules in etymology sections, including diacritics and transliterations as specified in section More details about planned improvements;
  • Improvements of queries as specified in section Queries.

Community discussion[edit]

Notification[edit]

  • Local Chapters (Wikimedia UK, Wikimédia France, Wikimedia Deutschland, Wikimedia CH, Wikimedia Italia)
  • Reddit
  • Twitter
  • Wikimania (see lightning talk proposal here)
  • People that are learning languages (at least 3 websites)
  • People interested in Dictionaries (at least 3 websites)
  • People interested in Etymologies (at least 3 website)
  • Linguistics Departments through mailing lists
  • NLP Departments through mailing lists
  • Wiktionary Etymology Scriptorium, Beer parlour, Grease Pit
  • ... suggestions?

Endorsements:[edit]

Do you think this project should be continued for another 8 months? Please add your name and comments here. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.

  • Continue: This project is only just beginning to touch the richness of the Wiktionary project, which is widely recognized in the academic community as the largest repository of linguistic and translingual data. - Amgine/meta wikt wnews blog wmf-blog goog news 16:32, 16 March 2017 (UTC)
  • Continue: A very promising future feature, a procedure for standardising the data of English (and any) wiktionary, to later export it to WikiData, to check errors, to contrast sources. I'm just waiting to see more. Bandeira Nunca M C3 A1is.png Sobreira (parlez) 14:15, 17 March 2017 (UTC)
  • I endorsed the original project. Epantaleo has clearly delivered on what was promised in that work. What's proposed here seems like a logical next step. If the work plan described here can be completed in 6 months, that will mean that we won't *need* another renewal. --EpochFail (talk) 14:09, 20 March 2017 (UTC)
  • Good idea!!--Ferdi2005 (Posta) 15:38, 28 March 2017 (UTC)
  • Continue: in order to be truly usable and visually appealing, the project needs more work, and I would really like to see it come to fruition. Andrew Sheedy (talk) 20:48, 29 March 2017 (UTC)
  • Continue: I followed closely Ester's work and I must say that I am impressed by the job already done. It would be a pity to left her in midstream. Work is very promising and I will do my best to support her for the next steps. As the maintainer of DBnary, I will provide her with the appropriate support to go further. Dodecaplex (talk) 13:27, 4 April 2017 (UTC)
  • Continue: THis is a very good big first step. We can see lots of possibility with more work. Lyokoï (talk) 09:25, 6 April 2017 (UTC)
  • Support Support + Continue -- It's the beginning of this project and we need it to continue. It will be a tool that will benefit the community greatly. -- Erika aka BrillLyle (talk) 01:54, 8 April 2017 (UTC)
  • Continue: the delivered project showed the power of using proper visualization tools to analyze big data as those which come from Wiktionary, but now it needs to be continued Luca.Barbi 14:44, 4 April 2017 (UTC)
  • Continue: Such projects contribute to making the world a better place. Thank you for your work and keep it up! CAD. 17:38, 11 April 2017 (UTC)
  • Continue: This project seems to me culturally intriguing and fully consistent with the Wikimedia strategy of connectivity improvement among languages as well as among project. I really hope WMF would give Epantaleo (and the possible other users working on it) strong commitment and appropriate support to both improve usability and visual interface, add functionalities in all the (at least, main) languages. --Nicolabel (talk) 07:48, 13 April 2017 (UTC)
  • Continue: A very nice idea, but it requires extra effort to reach its maturity. --alfredosette 13:46, 18 April 2017 (UTC)
  • Continue: Impressive project, full support. GrandCelinien (talk) 11:39, 1 June 2017 (UTC)
  • Continue: useful project, very interested in the Wikidata integration part. – Jberkel (talk) 08:45, 2 June 2017 (UTC)
  • Continue: this is cool! Aryamanarora (talk) 00:58, 7 June 2017 (UTC)
  • Continue: Great, I will need to look at it closer, but it seems to go in really good way. If that's not already the case, we should be able to query the database to feed Wiktionaries articles. I think the interface should provide a way to feed the database directly (creating new nodes and edges). If that's not already the case, relationship should be sourceable, as well as "earliest written attestations". Thank you for the project. --Psychoslave (talk) 11:59, 10 July 2017 (UTC)