Grants:IEG/A graphical and interactive etymology dictionary based on Wiktionary/Timeline

From Meta, a Wikimedia project coordination wiki

Timeline for A graphical and interactive etymology dictionary based on Wiktionary[edit]

Timeline Date
Having a Functional Database Extractor 30 August 2016
Having a Functional Database Management System 30 September 2016
Structuring an appropriate query to generate the etymology tree 30 Ottobre 2016
Having a Functional Visualization that takes as input the etymology tree; having 1000 words plotted 30 Novembre 2016
Testing - Dissemination - Finalization 30 December 2016


Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

July[edit]

I will report here my activities as in the Activities Section of my grant proposal. Then I will add some comments.

Database Extraction[edit]

  • I am writing a code that extends DBnary to extract etymologies from Wiktionary. With the developer of the software we have set upon a way to collaborate and he has created a repository where I have already started to submit pull requests.
  • To develop the software I had 3 meetings with my project advisor and a two day visit to Grenoble, where I met the developer of the DBnary software.

Test[edit]

  • I am testing the code on single Wiktionary pages including both English words and foreign words. I can currently extract etymological relationships from the Etymology section, descendants from the Descendants section and derived terms from the Derived Terms section although there are still bugs that I need to fix.

Community Dissemination[edit]

  • I gave two talks in Bari, one in the Department of Engineering Informatics and one in the Department of Informatics where I got interesting technical feedback.
  • I met with Wiktionary contributors in Lyon to discuss about etymology and integration of Wiktionary into Wikidata.
  • Furthermore I participated to Wikimania last month - that gave me the opportunity to discuss with many people, especially with Wiktionary and Wikidata contributors.

Integration With Wikidata[edit]

  • During Wikimania last month, I had a conversation with some members of the Wikidata "core team" and then we joined a meeting of Wiktionary contributors. During the meeting we discussed about the export of Wiktionary data into Wikidata. The discussion has been constructive and resulted in the creation of a Wiktionary group (the Wiktionary Tremendous Group) that gathers Wiktionary users of different language version. This group will definitely help develop a conversation between Wiktionary as a whole and Wikidata.
  • Tobias created a WikiProject Etymology: this is a first step to discuss about Etymology in Wikidata and gather feedback from the community.

Comments[edit]

Meeting people at Wikimania has been a greatly helpful experience and has helped me create a network of people that I can interact with and that can give me feedback. Also seeing that people were really interested in my project and that everyone was ready to help was great.

Getting feedback from experts in the field outside of the Wikimedia community (e.g., academics or industry people) is helping me also develop the project at best. Giving talks also helps me formalize my work.

August[edit]

Database Extraction[edit]

The latest extraction code can be found on bitbucket.

The etymology extraction tool ety2data has been merged with dbnary on bitbucket. Still we are working on the final merge step. In the mean time I fixed some major bugs and was able to run the code on the full English Wiktionary and obtain the RDF serialization for both English and foreign words - available upon request.

September[edit]

The etymology extraction tool ety2data has been merged with dbnary (code is available here), the RDF serialization has been produced (available upon request).

October[edit]

A Virtuoso server with the data can be queried at kaiko. We are currently working on how to set inference rules and ontology reasoning.

November[edit]

Testing[edit]

I have developed a testing strategy after interrogating the server for a generic word. I have posted some sample visualizations produced by the test on my Tool Labs web page

Setting up a new server[edit]

So far I have been using a server managed by the Dbnary developer. At this point I need to try and install my own server as I need to update the database frequently for testing, while I debug.

With this in mind, I have created an account in the Tool Labs and in Wikimedia Labs and I'll try to setup a Virtuoso server in the Wikimedia Labs.

However I need to be granted enough RAM to run the appropriate queries on the server.

Languages are now automatically downloaded from Wiktionary list of languages[edit]

I have modified the way the java application parses Wiktionary with respect to languages in such a way that the list of Wiktionary languages is automatically uploaded from Wikimedia. For this I have used the module List of Languages in cvs format. This module does not export etymology-only languages. I have created a new module Etymology-only languages in cvs format to export also those languages.

December[edit]

A Virtuoso server has been set up[edit]

A new Virtuoso server has been set up and data has been loaded here. You may access the data online using SPARQL (a specialised query language for RDF data) here.

The next steps[edit]

After trying different queries I have realized that I need additional data for efficient testing and to reduce query time.

First, in order to query for a particular word, e.g. door, with the current data I need to use regex which is slow. To speed things up I will add rdfs labels to etymology entries. Rdfs labels are indexed by Virtuoso, which makes queries much faster.

Second, for each etymology entry I want to save the information about where it was found, i.e., in which Wiktionary page it was mentioned. This way I can go back to that page and see if the extracted data is consistent with the original information contained in Wiktionary.

Third, I need to polish SPARQL queries to make them more efficient.

After those three step have been completed I will need to write a format converter, to convert the output of the SPARQL query into the input of the visualization code and at the same time I am gonna start looking for appropriate places where I can do dissemination.


Is your final report due but you need more time?



Extension request[edit]

New end date[edit]

February 15, 2017

Rationale[edit]

Mostly for personal reasons and also because of the complexity of the project. I will need some days to complete a first release and then some time to involve the community through dissemination.

Extension request[edit]

New end date[edit]

March 3, 2017

Rationale[edit]

I need a few more days to write up the final and outline the renewal proposal.

Renewal[edit]

June 2017[edit]

VISUALIZATION IMPROVEMENT[edit]

During this month I have mostly worked on the improvement of the visualization. I have tested different visualizations:

  1. d3 dendrogram (see [1]) with collapsible nodes, which is hard to use in this case as the graph is not a tree (i.e. not a dendrogram).
  2. d3 force-directed graph (see here) which gives a cluttered graphic presentation of the data as it has no preferred directionality.
  3. d3 dagre (see here), a JavaScript library to lay out directed graphs.
  4. webcola downwardedges (see http://marvl.infotech.monash.edu/webcola/examples/downwardedges.html) which I find very interesting although I find it quite hard to prevent overlapping of node tags in it.
  5. cola.js online graph exploration (see here) with expandable nodes which I plan on testing more accurately in the future.

For now I have decided the best alternative is d3 dagre, as I find it stable and clear, although I would like to add a feature to it: collapsible/expandable nodes.

DISSEMINATION[edit]

Polygloth Gathering 2017[edit]

During this month I presented my work at an international meeting: the Polyglot gathering, which was held in Bratislava from the 31st of May to the 4th of June. Five hundred people joined for this meeting speaking more than 125 languages. I presented a 45 minutes talk (see the program) with title "Etytree: a graphical multilingual etymology dictionary using data extracted from the English Wiktionary". At the meeting I received a lot fo interesting feedback and started a collaboration with one of the participants who is now actively contributing to the development of the software. Also I collected email addresses from interested participants which I will contact during the final testing phase of etytree.

Wikimedia France[edit]

I spent almost a week working from the Wikimedia France Office in Paris where I had the opportunity to interact with the local Wikimedia community. I also had a chance to present my work to some Wiktionarians and Wikidata experts.

July 2017[edit]

DATA EXTRACTION[edit]

During this month I have tried to improve the extraction algorithm. The main modifications I have worked on are:

  • extraction of special languages: Ido, Esperanto
  • improvement of queries: for this purpose I have changed the ontology in the database
  • diacritics: I started working with Latin diacritics to see how to deal with those cases
  • extraction of reconstructed words: now I have incorporated data relative to reconstructed words in Wiktionary thanks to an improvement to dbnary
  • debugging of compound extraction: there were some bugs with the extraction of compunds which includes extraction of sentences like "from wordA + wordB", "from wordA and wordB", and parsing of templates like affix, prefix, confix, infix etc.

August 2017[edit]

This month, besides regular software development, two things happened: first I had the opportunity to partecipate to Wikimania in Montreal, Canada, second a volunteer with expertise in Linguistics and Programming joined the project and is giving substantial help.

In terms of development, the main change has been using RxJS (The Reactive Extensions for JavaScript) which allowed me to speed up etytree, through asynchronous requests to the Virtuoso server. Another major improvement has been the use of Node js modules which has made it easier to test different types of visualization: improving the visualization is one of the purposes of this project.

At Wikimania I met a number of Wikidata developers and I managed to arrange a visit to the Wikidata offices in Berlin, which would be very helpful for the possible integration of this project into Wikidata. After talking with some of the people in the Wikidata team I got two main suggestions: first that I try vis.js as a library to visualize graphs and second that I use "POST" instead of "GET" when a request (the query string) to the server is too long. Next month I'll work on those two aspects.

September 2017[edit]

This month we have worked mostly on vis.js (the graph visualization library recommended by some Wikidata developers) and on the implementation of "POST" requests to the server. Also we have improved the visualization by improving the query that returns the set of ancestors of a word.

Testing a new visualization library: vis.js[edit]

The javascript library vis.js has been recommended to us as is used in Wikidata and as graphs in vis.js can be collapsed or expanded. Collapsing or expanding nodes would be helpful in etytree as some nodes have many derived terms and collapsing them would reduce the clutter in the visualization. for these reasons we decided to modify etytree to use the vis.js library.

After many efforts spent on implementing and testing vis.js, however, we finally decided to stay with d3.js for a few reasons.

The main reason is that vis.js does not guarantee directionality. Having a directionality is more important than being able to expand/collapse nodes. We will explain my statement with a screenshot.

This is a screenshot of the directed graph representation for word "door" using the javascript graph visualization library vis.js
This is a screenshot of the directed graph representation for word "door" using the javascript graph visualization library vis.js

Even though vis.js has been set to use a "left to right" representation of directed graph, the final result in the graph often doesn't present a left to right orientation of the graph, as it happens for example for the directed graph of word "door": not all arrows go from left to right.

Another reason is that vis.js uses canvases while d3.js is vectorial and the final rendering is visually more effective.

you can see how vis.js performs by checking out the visjs-feature branch of etytree.

Dealing with long queries: "POST" compared to RxJS[edit]

I implemented a post request to the Virtuoso server using the following function:

   var postXMLHttpRequest = function(content) {
               const params = new URLSearchParams();
               params.set("format", "application/sparql-results+json");
               var url = endpoint + "?" + params;
               var formData = new FormData();
               var blob = new Blob([content], { type: "text/xml" });
               formData.append("query", blob);
               var settings = {
                   method: "POST",
                   body: formData
               }
               return fetch(url, settings)
               .then(res => { return res.json(); })
               .then(json => { return json.results.bindings; });
           };

We noticed though that the alternative method, i.e. splitting a long query into many small queries that we submit using RxJS asynchronously (function Rx.Observable.zip.apply), is much faster than a POST request which takes a longer time as it is run synchronously. For this reason we are using RxJS instead of posting the query.

Improving ancestors query[edit]

We replaced a simple query like the following

    var ancestorQuery = function(iri, queryDepth) {
           var query = "PREFIX dbetym: <http://etytree-virtuoso.wmflabs.org/dbnaryetymology#> ";
           if (queryDepth === 1) {
               query +=
                   "SELECT DISTINCT ?ancestor1 ?ancestor2 " +
                   "{ " +
                   "   <" + iri + "> dbetym:etymologicallyRelatedTo{0,5} ?ancestor1 . " +
                   "   OPTIONAL { " +
                   "       ?eq dbetym:etymologicallyEquivalentTo ?ancestor1 . " +
                   "       ?eq dbetym:etymologicallyRelatedTo* ?ancestor2 . " +
                   "   } " +
                   "} ";
           } else if (queryDepth === 2) {
               query +=
                   "SELECT DISTINCT ?ancestor1 " +
                   "{ " +
                   "   <" + iri + "> dbetym:etymologicallyRelatedTo{0,5} ?ancestor1 . " +
                   "} ";
           }
           return query;
       };

with a query that returns an ordered set of ancestors

   var parseAncestors = function(response) {
           var ancestorArray = response.reduce((all, a) => {
               return all.concat(JSON.parse(a).results.bindings);                                 
           }, [])
               .reduce((ancestors, a) => {
                   ancestors.push(a.ancestor1.value);
                   if (a.der1.value === "0" && 
                       undefined !== a.ancestor2 && 
                       lemmaNotStartsOrEndsWithDash(a.ancestor1.value)) {
                       ancestors.push(a.ancestor2.value);
                       if (a.der2.value === "0" && 
                           undefined !== a.ancestor3 && 
                           lemmaNotStartsOrEndsWithDash(a.ancestor2.value)) {
                           ancestors.push(a.ancestor3.value); 
                               if (a.der3.value === "0" && 
                               undefined !== a.ancestor4 && 
                               lemmaNotStartsOrEndsWithDash(a.ancestor3.value)) {
                               ancestors.push(a.ancestor4.value);
                               if (a.der4.value === "0" && 
                                   undefined !== a.ancestor5 && 
                                   lemmaNotStartsOrEndsWithDash(a.ancestor4.value)) {
                                   ancestors.push(a.ancestor5.value);
                               }
                           }
                       }
                   }
                   return ancestors;
               }, []).filter(etyBase.helpers.onlyUnique);
           return ancestorArray;
       };

Because with this new query ancestors are ordered (ancestor1 is the direct ancestor of the searched word, ancestor2 is the direct ancestor of ancestor1 and so on) ancestors can be filtered based on their order. For example, if the word we are searching is a compound word (say "doorbell"), we want to only visualize the first ancestors ("door" and "bell"), we don't want to also see ancestors of the words that make up the searched word (ancestors of "door" and ancestors of "bell"), which will just clutter the visualization. Using a query that returns an ordered set of ancestors, we can stop at the desired depth (at ancestor1, or at ancestor2, and so on) depending on the word of interest.

October 2017[edit]

This month I participated to WikidataCon 2017 in Berlin. I was also given the opportunity to work from the Wikidata offices for a bit more than 2 weeks.

At WikidataCon I had the opportunity to present etytree in a lightning talk (video available here - minute 19:48). I got interesting questions about the tool (if I want to use more Wiktionary versions besides the English one and how I deal with inconsistencies). I also got comments about the visualization: a couple of users thought that the tree is too big and difficult to look at.

Before the conference I had already started exploring other kinds of visualizations, where I don't represent the full tree but only the chain of ancestors. If this was the full tree I was showing before:

A screenshot of the etymological tree of English word tear
A screenshot of the etymological tree of English word tear

now I'm showing instead only the chain of ancestors:

A screenshot of the visualization of the etymology of English word tear
A screenshot of the visualization of the etymology of English word tear

The user can then click on a node and see the list of its descendants, grouped by language. For example, if the user clicks on the reconstrucetd Proto-IndoEuropean word *dáḱru- and then chooses Italian, he/she gets the list of Italian descendants:

The list of Italian descendants of Proto-Indo-European *dáḱru-e
The list of Italian descendants of Proto-Indo-European *dáḱru-e


It is interesting to see how different words: English tear, Italian lacrima and zaccheroso derive from the same ancestors (and have a similar meaning).

Most of the work this month has been about filtering and optimizing the visualization, improving queries to gather more ancestors or more descendants, speeding up queries. Next month I'll try to work on user experience: I'll show the tool to different people and see how they react and what they get out of the tool.

November 2017[edit]

Notes on how to extract "dates" from Wiktionary[edit]

This note follows a twit by Wikimedia Research on the Etytree talk at WikidataCon17 in Berlin, and a following comment by a third person about whether etytree contains information on when a specific word has been introduced in a specific language.

My answer to the twit has been that etytree does not contain that temporal information. Etymology sections in Wiktionary rarely specify dates and when they do, they don't follow standards.

However the English Wiktionary uses a few templates related to dates outside of etymology sections, namely {{defdate}} (used on ~4000 Wiktionary pages, I believe mostly for individual senses of lexical entries), {{ante}}, {{circa}}, {{C.E.}}, {{B.C.E.}} (~500 times), {{circa2}} (~100 times), {{C.}} (~50 times). For example the {{defdate}} template is attached to a single sense. As a lexical entry can have multiple senses, and an etymology entry can refer to multiple lexical entries, there is some ambiguity. To solve this ambiguity the oldest date among all dates of all senses of all lexical entries described by the etymology entry might be chosen. However of ~5 million pages that make up Wiktionary, only 4K pages use the defdate template.

To wrap up temporal information seems to be encoded in Wiktionary in an almost machine readable way. However data is very incomplete.

Interaction with the Wikidata Lab[edit]

During my stay in Berlin I had the opportunity to work for a couple of weeks from the Wikidata Offices and interact with the people that work there. I got interesting feedback on my work on etytree from them. The UX people recommended that I tailor the UI on the specific type of users I have in mind for the tool. Lydia Pintscher (product manager for Wikidata) offered me the opportunity to write about etytree on the blog of Wikimedia Deutchland. The purpose of writing about this would be twofold: attract users to the tool and show how Wikidata would be the perfect resource to produce this kind of visualizations, once the Wikibase Lexeme data model will be ready. She suggested that I describe etytree, its purpose and its usage but also describe the hurdles I have been facing when extracting data from Wiktionary etymology sections, and how this would be much easier with Wikidata.

Also Lydia suggested that I improve the landing page of etytree with 3 to 5 quick examples and that I post about the tool on Wiktionary once I feel that the tool is in a nice shape.


December 2017[edit]

Refactoring the javascript code[edit]

Following suggestions from the Wikidata developers team I have separated the javascript code on Github to create a separate code for the view (graph.js), the model (datamodel.js) and the controller (app.js). This has required a good amount of work.

Improving extraction code (language parsing)[edit]

I have improved parsing of Chinese and Japanese words in three ways:

  • I added parsing of template template "zh-psm" a template for Phono-semantic matching.
  • I modified the extraction code to ignore links for word ateji which is a qualifier and not an ancestor in etymology sections.
  • I used to not allow users to search individual character (e.g., a, b, etc). I changed this to allow users to search individual characters if they are not European, so now they can search a string like and see its etymology.

Final report[edit]

Part 1: The Project[edit]

Summary[edit]

The etytree tool[edit]

Methods and activities[edit]

  1. Visual interface improvement: compared with the previous visual interface where we were using a full tree with all ancestors and all descendants of all ancestors, etytree's visual interfaces now only shows the chain of ancestors, but users can click on each of the ancestors and then visualize descendants of the clicked ancestor in a language of their choice, as shown in the animated gif Add link
  2. Parsing improvement: etytree now uses special rules to parse 15 languages, plus it filters out sign languages, as, unfortunately, they are not easy to parse because of their format. The list of language with the associated templates is the following
    1. Ido: different parsing of the whole etymology paragraph;
    2. Esperanto: different parsing of the whole etymology paragraph;
    3. Vietnamese: parsing of specific templates {{vi-etym-sino}}, {{vi-el}};
    4. Korean: {{ko-etym-Sino}}, {{ko-etym-sino}}, {{ko-etym-native}}, {{ko-l}};
    5. Hungarian: {{hu-prefix}}, {{hu-suffix}};
    6. Japanese: {{ja-r}}, {{ja-l}}, parsing of the special word "ateji";
    7. Finnish: {{fi-form of}};
    8. Gothic: {{got_nom form of}};
    9. Arabic: {{ar_root}};
    10. Chinese: {{Han compound}}, {{Han sim}}, {{zh-psm}}, {{zh-l}}, {{zh-mm}};
    11. Old Chinese: {{och-l}};
    12. Thai: {{th-l}};
    13. Middle Chinese: {{ltc-l}};
    14. Hebrew: {{he-m}}, {{m/he}};
    15. Lojban: {{jbo-etym}};
    16. Sign languages: ignored;
  3. Diacritical marks: started with Latin
  4. Transliteration
  5. Software documentation improvement
  6. Queries improvement: for this purpose I have changed the ontology in the database
  7. Dissemination: we participated to 8 meetings in 6 different countries:
    1. Polygloth Gathering (Bratislava, Slovakia)
    2. Wikimania 2017 (Montreal, Canada)
    3. WikiDataCon17 (Berlin, Germany)
    4. itWikiCon 2017 (Trento, Italy)
    5. Balab (Bari, Italy)
    6. Impact Hub (Bari,Italy)
    7. WWW17 (Perth, Australia)
  8. More tasks
    1. Interaction with the Wikidata developers community: spent some days in the Wikidata Offices in Berlin interacting with the developers community.
    2. Improvement of the Virtuoso faceted browser: fixed the link issue on Virtuoso.

Part 2: The Grant[edit]

Finances[edit]

Item Area Description Commitment Cost
Visual interface and parsing method improvement Software development Responsible for the whole work package Full time (40 hrs/week), 8 PM(1) 38,400 €
Dissemination Participation to Wikimania 2017 in Montreal (1) Travel & lodging Una tantum 408.4 €
Presentation at WikidataCon17 in Berlin, Germany + collaboration with the developers lab for 15 days (2) Travel & lodging Una tantum 754.85 €
Presentation at Wikimedia France in Paris, France (3) Travel & lodging Una tantum 276.99 €
Presentation at http://www.polyglotbratislava.com/en/ (accepted) (4) Registration, travel, board & lodging Una tantum 532.79 €
Presentation at itWikiCon 2017 Trento, Italy (5) Travel & lodging Una tantum 207.75 €
Presentation at WWW17 Perth, Australia poster Una tantum 47.6 €
Presentation at Balab and Impact Hub, Bari, Italy - Una tantum 0 €
Total(6) 40,219.98

(1) 132.56 € (travel: airplane + bus)+ 275.84 € (board + lodging) = 408.4 €

(2) 184.85 € (travel: airplane + airport transfer + train) + 570 € (board + lodging) = 754.85 €

(3) 174.43 € (travel: airplane + airport transfer + train) + 102.56 € (board + lodging) = 276.99 €

(4) 391.6 € (registration + board + lodging) + 141.19 € (airplane + bus + airport transfer)

(5) 132 € (board + lodging) + 75.75 € (airplane + train + airport transfer)

(6) predicted expenses = 43,549

Grantee reflection[edit]