Grants talk:IEG/A graphical and interactive etymology dictionary based on Wiktionary

April 12 Proposal Deadline: Reminder to change status to 'proposed'

Latest comment: 8 years ago1 comment1 person in discussion

@Epantaleo: The deadline for Individual Engagement Grant (IEG) submissions this round is April 12th, 2016. To submit your proposal, you must (1) complete the proposal entirely, filling in all empty fields, and (2) change the status from "draft" to "proposed." As soon as you’re ready, you should begin to invite any communities affected by your project to provide feedback on your proposal talkpage. If you have any questions about finishing up or would like to brainstorm with us about your proposal, we're hosting an IEG proposal help session tomorrow on April 12th from 16:00 - 17:00 UTC on Google Hangouts.

I'm also happy to set up an individual session tomorrow if needed.

With thanks, I JethroBT (WMF) (talk) 22:30, 11 April 2016 (UTC)Reply

Integration with Wikidata

Latest comment: 8 years ago1 comment1 person in discussion

@User:Jura1: Thanks a lot for your comment. I just posted a question in Wikidata about that (here). I was assuming once generated, the output RDF file would be easy/standard to export to Wikidata. Is that correct? I would like to ask you other questions too, if you have time. Where do you prefer to chat? — The preceding unsigned comment was added by Epantaleo (talk)

Just a quick answer to your question: Wikidata is currently primarily concept-based. There were several proposals about inclusion of Wiktionary (most recent at d:Wikidata:Wiktionary/Development/Proposals/2015-05), but I don't think there is a plan, budget or target date for what will actually happen. It could be that there will be separate instance would hold only special items for words from Wiktionary. The only exceptions we currently have are first names and family name (each variation gets an item). There is also d:Property:P138 and a complex structure for taxon names.

We can chat on a user page on Meta. Maybe this is less formal than on your proposal page. BTW for the ping to work, don't add "d:". --Jura1 (talk) 16:19, 15 April 2016 (UTC)Reply

Other users encouraged me to involve more directly the Wikidata community. I sent a message to the Wikidata mailinglist wikidata@lists.wikimedia.org and I'm waiting for feedback. Epantaleo

Maintainability

Latest comment: 7 years ago2 comments2 people in discussion

As both a Wiktionarian and an etymology researcher, I am finding this proposal highly interesting. This is though more due to the database side of the project than the visualization aspect. Etymological data greatly benefits from being structured; as you note, e.g. due to the ability to link multiple words to each other at once, instead of having to individually maintain separate etymology sections for multiple closely related words (such as cognates in a set of sister languages).

This project sounds like it has the potential to become a seed (other similar works probably exist elsewhere) for a truly global etymological database, cataloguing the products of research worldwide. On the other hand — it also has the potential to become an unmaintainable source of editorial busywork, littered with outdated and/or duplicate data.

Some questions on what the plans for the future maintenance of the etymological database are (a few of these I have seen answered in some of the initial notifications, but the proposal would benefit from spelling this out more exactly):

Where will the etymological database be hosted? As a part of Wiktionary?
Is data going to be periodically extracted from the textual contents of Wiktionary, to maintain up-to-date-ness, or is only manual database extraction planned — leaving the database possibly slowly drifting ever more out of sync with Wiktionary (and with ongoing primary etymological research)?
- Or is the proposal for the database to supercede current etymological information at Wiktionary altogether?
Etymology, like all historical sciences, does not involve exact and provable facts. Will the database reflect this reality?
- If competing etymologies have been presented for a word, will this be indicated?
- Will it be possible for a reader to somehow examine the actual arguments that have been presented in favor of one out of a competing etymology?
- Will there be any attempt to model the hypotheticality (and non-exactness) of reconstructed proto-forms?
- If no to all or some of the above, what will be the method of deciding when an etymology has been established firmly enough that it can be stated as a simple matter of fact? Derivations between attested languages only? Naive belief in that whatever has been inserted at Wiktionary is correct?
Etymology is not delimitable. All languages are ultimately etymologically connected, through e.g. loanwords. Are there any limits planned on what current-day languages' etymological information will be included?
- As the division between "language" and "dialect" is arbitrary, will the database implement its own list of canonical languages to etymologically track, or will it be somehow synced with e.g. Wiktionary's own (and still constantly evolving) list of distinguished language varieties?
Will there be a beta test on the editability of the data?

Some of these questions may exceed the scope of the project; I'm open to taking parts of the discussion elsewhere, if there is interest. --Tropylium (talk) 16:45, 22 April 2016 (UTC)Reply

@Tropylium: Thanks for you insightful comments and for showing your interest! I'll try to answer your questions here first and then I will try and integrate these comments in the grant.

I am discussing on the wikimedia multimedia IRC channel and on the wikimedia database channel where to host the database. Apparently wikidata is not ready to host it. I will know more about this soon.
The idea is to synchronize this database with Wiktionary (see DBpedia live) or update it periodically (as it is the case for the database built with dbnary).
In a first version, for simplicity, no conflicting etymology will be included, or only one of the conflicting versions. The plan is to have multiple (linked) visualizations when there are conflicting etymologies and notes on branches of the tree (or elsewhere) if there are controversies on etymological relationships (or elsewhere).
When I started developing the etymological relationships extraction tool I looked into how to extract conflicting etymologies from Wiktionary. I couldn't find any clear pattern or standard rule (suggestions welcome!). This is the main problem with a textual only etymology: following/setting rules or standards when editing. Hopefully this project will help set standards, otherwise an important part of the information contained in Wiktionary etymologies will not be machine readable and usable for research.
Note that conflicting etymologies could be assigned some kind of confidence with a default "null prior" probability (0.5 probability to each alternative if there are only two alternatives), with the etymological tree that editors believe is more likely being displayed first (I envision a stack of multiple etymological trees that the user can navigate through when conflicting etymologies are available). In a database this would mean attaching a probability (a number from 0 to 1) to the etymological relationship. I plan on working on this at a second time though.
I am not sure what you mean when you say "Will there be any attempt to model the hypotheticality (and non-exactness) of reconstructed proto-forms?". As of now an * preceds a word if the word is reconstructed. Is this enough? Or maybe a special label for the branch or a special color for the branch could be helpful? Suggestions are very welcome.
In a first version, I plan on using all Wiktionary languages as well as special languages but I think for consistency maybe these should be reduced to a smaller set later on.
There will be.

I am looking forward to your following comments. Thanks again! --Epantaleo

@Tropylium:, @Epantaleo: As the main developer and maintainer of DBnary that is mentioned by the proposal, I'll add my point of view on the questions raised here.

If the extraction programs are integrated into DBnary extraction framework (as it is the case for the proof of concept now), then, the extracted dataset will be integrated into DBnary and made available as a separate download file (1 per source language) AND available as linked data in the DBnary server. As mentioned earlier, there are no problems for the data to be included in other datasets (e.g. WikiData, should it be ready to receive dictionary information).
DBnary is updated as soon as new dumps are available. This maintains up-to-dateness and also allows for the study of the data evolution (all extracted history is also available). DBnary is also able to monitor the performance of the extractor and detect when the wiktionary community changes its templates, leading to a more reactive maintenance of the extractors.
one interesting aspect of DBnary is the archive of all extractions. This may be interesting to detect when communities frequently changes some etymology data...

Other questions have been addressed by the proposal submitter. --dodecaplex

What about "popular" but wrong etymologies? There are articles which do include such data. Would you include that as a "conflicting" etymology with some label on the tree? It would be interesting to have a associated tag/category for this kind of cases. --Psychoslave (talk) 18:02, 6 November 2016 (UTC)Reply

@Psychoslave: thanks for your comment Psychoslave! the method I'm using to extract etymological information from Wiktionary is automated. And unfortunately there is not standard to tag "popular etymologies" in Wiktionary - it would be great to have that. As a consequence I'm not capable of extracting that information unless some kind of manual editing is done. Does it make sense?

Eligibility confirmed

Latest comment: 8 years ago1 comment1 person in discussion

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for review and scoring. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period (through 2 May 2016).

The committee's formal review begins on 3 May 2016, and grants will be announced 17 June 2016. See the round 1 2016 schedule for more details.

Questions? Contact us at iegrantswikimedia · org .

--Marti (WMF) (talk) 04:45, 28 April 2016 (UTC)Reply

Pointers to related work

Latest comment: 8 years ago1 comment1 person in discussion

Hi there,

Interesting proposal. I thought the following couple of links could be useful for your project:

Cheers, --Hjfocs (talk) 09:50, 6 May 2016 (UTC)Reply

@Hjfocs: Thanks for the pointers!

I can see that people have been discussing for a long time about how to integrate Wiktionary data into Wikidata and I believe that with some efforts and a tight collaboration with the Wikidata community this project could really be a first big step towards the much needed integration.

For this purpose there are many open source frameworks that have been built outside of Wikimedia to extract machine readable information from Wiktionary and that could be used in Wikimedia. I chose Dbnary because the software is well documented, because the same group is developing Blexisma (a multi agent system for the creation of a semantic lexical database), and because Dbnary has already been developed for other languages besides English (namely Bulgarian, Dutch, Finnish, French, German, Greek, Italian, Japanese, Polish, Portuguese, Russian, Serbo-Croat, Spanish, Swedish and Turkish). Another interesting alternative is JWKTL to parse a Wiktionary dump together with UBY a standardized linked lexical resource including among others WordNet and Wiktionary and sense alignments between them. Also people at WordNet seem very interested in integrating this visual extension with the semantic network through the Wordnet.

Any suggestions about how to increase participation from the Wikidata community is really welcome.

Cheers! --Epantaleo

Comments

Latest comment: 8 years ago1 comment1 person in discussion

I was asked to give some comments on the proposal:

As with the other user above, I agree that the most interesting aspect of this proposal is the extraction and database part, not the visualisation part. So I would focus more on this than on the visualisation (which, given the a reasonable database format can be done fairly easily).
The proposal would be much more compelling if it were to import the data into Wikidata, and then users would be able to query it in creating new etymologies, or at least use it as some kind of validator for manually creating entries. As someone who works a lot with data conversion, I know how frustrating it is to have two (or thirteen!) copies of the same data which are then independently updated or out of sync.

- Francis Tyers (talk) 10:24, 7 May 2016 (UTC)Reply

@Francis Tyers: Thanks for your comment!

I agree with you the database part is the most interesting part because of its many potential applications (e.g., many types of visualizations could be developed from a Wiktionary database). That is why as I write in the project 50% of the time will be spent on developing the extraction code, and only 5% of the time on the visualization; the rest of the time will be spent on other aspects and especially on checking that the output is consistent.

However at this stage I think visualizing the tree of etymologies is useful to:

developers to check that the extracted data are consistent
editors to check that their edits are consistent with what others have contributed (on different pages)

As a bonus, through the visualization, users will be able to discover new etymological relationships and new words in different languages.

Also I think the visualization will help define new standards and templates within etymology sections as editors will see what is currently exportable and which type of information instead is not.

Synchronizing the database with Wiktionary is also fundamental and this is what I want to work on.

Finally I would be happy to find ways to export data into Wikidata and collaborate with the Wikidata community.

-- Epantaleo

Translation to the Wikidata Data Model

Latest comment: 8 years ago1 comment1 person in discussion

The latest proposal for the integration of Wiktionary into Wikidata seems to be compatible, at least from a conceptual perspective.

For instance, let's compare its supercompact mockup with a sample output of this proposal:

eng:trance__Etymology_1
    a dbnary:EtymologyEntry ;
    dbnary:etymologicallyDerivesFrom enm-eng:traunce ;
    dbnary:refersTo eng:trance__Noun__1 .

The eng:trance__Noun__1 node would become a Wikidata Item, e.g., L1234 with Label trance, Language eng, and Word type noun, while the etymological-specific relations like dbnary:etymologicallyDerivesFrom would be attached as statements via the Etymology Wikidata property.

I would love to read further investigation on the topic, as it seems to me a way to get this proposal actually integrated into Wikidata. --Hjfocs (talk) 16:58, 31 May 2016 (UTC)Reply

Wikidata Toolkit and DBnary

Latest comment: 8 years ago1 comment1 person in discussion

Keeping in mind the integration of this proposal into Wikidata, here is a more technical thought.

Some DBnary components could probably be reused to extract the data from Wiktionary. Then, the extraction output could be connected to the Wikidata Toolkit, which will be responsible of the translation into the Wikidata data model. --Hjfocs (talk) 17:09, 31 May 2016 (UTC)Reply

Primary Sources Tool

Latest comment: 8 years ago1 comment1 person in discussion

With respect to the potential inclusion of the output dataset of this proposal into Wikidata, the primary sources tool could be an optimal candidate, since I expect the dataset to:

possibly contain extraction errors, thus needing a validation step;
be too large for a direct inclusion via a bot.

In this case, the dataset should be serialized into the QuickStatements syntax. --Hjfocs (talk) 17:18, 31 May 2016 (UTC)Reply

Aggregated feedback from the committee for A graphical and interactive etymology dictionary based on Wiktionary

Latest comment: 8 years ago1 comment1 person in discussion

Scoring rubric	Score
(A) Impact potential Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both? Does it have the potential for online impact? Can it be sustained, scaled, or adapted elsewhere after the grant ends?	7.0
(B) Community engagement Does it have a specific target community and plan to engage it often? Does it have community support?	6.1
(C) Ability to execute Can the scope be accomplished in the proposed timeframe? Is the budget realistic/efficient ? Do the participants have the necessary skills/experience?	5.3
(D) Measures of success Are there both quantitative and qualitative measures of success? Are they realistic? Can they be measured?	5.9
Additional comments from the Committee: OMG this is an awesome proposal. Can't believe I missed it on the mailing list. Love the idea of merging the efforts of Wiktionary users across languages this way. What an amazing idea. I hate the idea of generating images per word, though. That should be generated on-the-fly (maybe on Toollabs). This proposal fits with the strategic priority to develop high-priority curation and creation tools as well as provide improved programs, experiences, and resources. I think there is some interesting potential here: not only is it an experiment that could make it easier for editors to update content/data currently hosted in many different places, but it could also reveal important takeaway lessons for the community's move towards structuring Wiktionary data and working with Wikidata. There does not seem to be much planning for sustainability of the project after the grant, but apart from having the code made open source (a prerequisite of IEG funding anyways), there's no good reason to assume the project would be maintained. Looks like it could really help the spread of WikiData, too, if the applicant could find a way to make that work. Great addition to a dictionary. Clear path and plan. Yes, all language Wiktionaries could take advantage of this. Measures of success are not defined. There is a significant amount of innovation here. While this is the largest request under "Tools," it is asking for 6 months of full-time pay and the work looks like it might actually take that much time. At the same time, this will require work in Natural Language Processing. While I don't doubt the ability of the applicant, no demonstrated prior knowledge creates a risk. Good visualization idea. The measures of success are somewhat unclear, which is worrying when looking at the project manager’s proposed salary. Looks doable, but I wouldn't do it in the proposed form, as it would generate lots of .png files that would require maintenance. The applicant seems to have relevant experience and is knowledgeable but does not appear to have much experience with WMF projects. The budget seems high for a project that is not solving a key problem or addressing a need (our standard hourly rate for IEG work also rarely exceeds $30). The proposed community target is Wiktionary, but speaking as a Wikidatan I see the use of this for the ontology of tags used in the "depicts" property of artworks, and maybe even in the ontology of classes and subclasses of things. The applicant seems to be very receptive to feedback and has some endorsements. However, no details are provided about how the the Wiktionary community of etymology editors will be involved (in step 4 and 5). I have no idea how big the target community of etymology editors is, or how much interest or demand there is for a project like this. There is strong community support, with a fitting target of Wiktionary. The project covers a lot of community and has many endorsements. Those are good things. It's foreseeable to see the project working for many Wiktionary users. This project has no .png generation, but on-the-fly generation of the word tree. Otherwise, go for it! It's an intriguing idea and would be an interesting experiment. But it's also quite a lot of funding for a project with no clear measures of success or a plan to sustain the work after the grant ends. I'd like the applicant to bring down their requested salary a bit, provide measures of success, and a plan for engaging the community. The project is interesting, but in my opinion there should be preliminary work to check if it could be associated with Wikimedia projects. Wikitionary is not mature in my opinion and a graphical result needs a review of Mediawiki and a big assessment of the community. This is a risky one since it is 30% of the 100,000 budget, but it is a needed tool and the future possibility for integration with WikiData is a real draw. The project’s cost is quite expensive in that it consists mostly of a contractor’s salary. Project is of great interest to me. However, I do not believe full-time employment is necessarily warranted, as this tool simply isn't critical enough. I would suggest a part-time approach: after the first 6 months of development, there is a possibility of an extension if there is enough progress and need by then. Any conference activity should be considered based on the progress on the product.

-- MJue (WMF) (talk) 01:04, 3 June 2016 (UTC) on behalf of the IEG CommitteeReply

Round 1 2016 decision

Congratulations! Your proposal has been selected for an Individual Engagement Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $30,000

Comments regarding this decision:
The committee is pleased to support your work to make etymological knowledge on Wiktionary more usable. We appreciate the greatly expanded applications that become possible through extraction of this data into a database, and we looks forward to seeing your work incorporated into Wikidata.

Next steps:

You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
Review the information for grantees.
Use the new buttons on your original proposal to create your project pages.
Start work on your project!

Questions? Contact us.

Website

Latest comment: 7 years ago2 comments2 people in discussion

https://www.epantaleo.com appears to be offline. --Nemo 08:18, 18 June 2016 (UTC)Reply

http://www.epantaleo.com, no httpS. Sobreira (parlez) 14:47, 26 February 2017 (UTC)Reply

@Sobreira: Thanks!

Suggestions on how to make Etymology Sections easy to parse

I am extracting etymological relationships from the English Wiktionary, using an automated code. The result is a database of etymological relationships and a tool to interactively explore etymological relationships in a graphical way. The tool is working pretty well, because etymology sections in the English Wiktionary are written using very well defined standards. Usually etymology sections have the following structure - for an arbitrary entry X:

From {{etyl|enm|en}} {{m|enm|Y}}, from{{etyl|ine-pro|en}} {{m|ine-pro|*Z}}, etc. Cognate to {{cog|en|C}}.

Alternatively they have the following structure:

From {{etyl|enm|en}} [[Y]], {{etyl|ine-pro|en}} [[*Z]], etc. Cognate to [[C]].

And variations of the above structures. Such standards imply that I can use an algorithm to extract etymological relationships:

X etymologically derives from Y

Y etymologically derives from *Z.

This authomatic extraction breaks when in the etymology section there are additional words embedded into links or templates but not relevant to infer etymological relationships.

A way around this would be to

1. Disencourage usage of links for ancestors

Example

{{|m|en|bemro}}: A [[portmanteau]] of the [[Lojban]] words ''[[berti]]'' and ''[[merko]]'' (Lojban for North American) <ref>Etymologies are publicly listed on the [http://www.lojban.org/tiki/tiki-index.php?page=gismu_etymology official website].</ref>.

should be (I just edited it)

{{|m|en|bemro}}: A [[Appendix:Glossary#portmanteau|portmanteau]] of the {{etyl|jbo|en}} words {{m|jbo|berti}} and {{m|jbo|merko}} (Lojban for North American) <ref>Etymologies are publicly listed on the [http://www.lojban.org/tiki/tiki-index.php?page=gismu_etymology official website].</ref>.

also maybe here "and" should be replaced by "+"?

2. Link to Glossary (for words that are not ancestors)

Form of, abbreviation, onomatopoeia, ideophones, portmanteau, genitive, participle etc should always be linked to the Glossary in the Appendix. Words like genitive, ablative etc can be signaled by linking to the glossary: e.g.

From {{etyl|fr|en}} {{m|fr|Y}}, from [[Appendix:Glossary#genitive|genitive]] {{etyl|la|en}} {{m|la|Z}} etc. Cognate to {{cog|en|C}}.

3. Add a new template (qualifier?) or html code to specify that a word is not an ancestor

4. More suggestions

4.1. Proposal to distinguish "and" and "+"

"+" should be used for compounds only
"and" should be used for cases like

{{m|en|mongoe}}: From {{etyl|af|en}} {{m|af|moegoe}} and {{etyl|cmt|en}} {{m|cmt|moegoe}}. Further etymology uncertain.

4.2. Encourage correct usage of comma

Correct usage of the comma is important especially for compounds.

example

{{m|en|door]], + {{m|en|bell}} is not correct

From {{etyl|enm|en}} {{m|enm|Y}} from{{etyl|ine-pro|en}} {{m|ine-pro|*Z}} is not correct

4.3. Proposal to set standards for alternative etymologies

Set a standard for alternative etymologies, e.g. {{alt1=From {{etyl|af|en}} {{m|af|moegoe}}, from ...|alt2=...}}

4.4. Encourage use of template for earliest known usage

Introduce a standard (e.g. for how to format years, centuries etc) for earliest known usage. The template {{defdate}} is a good one, it could be extended to include an argument for references.

Example

See {{m|en|weekend warrior}}

4.5. Deprecate use of "Via" in favor of "From"

4.6. Encourage use of `{{unk}}, {{etyl|und}}, {{etystub}}, {{rfe}}` instead of unknown etymology, disputed etymology etc

4.7. Encourage use of surface etymology to signal a surface etymology

4.8. Proposal to add new templates

4.8.1. Template "Detailed etymology"

Introduce a standard for longer discussions or non standard etymologies, e.g. {{detailed etymology}}.

Examples where it would be needed:

{{m|ase|6@Chin}}: From initial letter {{m|mul|W}} of the English word {{m|en|water}}. Using a location similar to the ASL sign {{m|ase|C@NearChin-TipForward C@NearChin-TipUp||drink}}

.

{{m|en|put up one's dukes}}: Possibly by [[analogy]] to a [[king]] or other [[ruler]] [[summon]]ing his [[duke]]s, and by extension the duke's [[knight]]s or other [[soldier]]s, to [[battle]] an [[enemy]].  Another possibility is [[Cockney rhyming slang]] as explained at {{m|en|duke}}

.

4.8.2. Template "developed from initialism"

4.8.3. Template "named-from"

Something similar to the already existing template {{named-after}}

example

{{m|en|brave new world}}: From the title of {{w|Aldous Huxley}}'s 1932 novel ''{{w|Brave New World}}'', which is in turn a reference to a line from {{w|William Shakespeare}}'s play ''{{w|The Tempest}}'' (first performed around 1611).

could be

{{m|en|brave new world}}: {{named from|sub=title|what=novel|what_p=Brave New World|by=Aldous Huxley|date=1932}}, which is in turn a reference to {{named-from|sub=line|what=play|what_p=The Tempest|by=William Shakespeare|date=1611}}

{{m|en|all the world's a stage}}: From the beginning of a [[monologue]] delivered by the character [[w:Jaques (As You Like It)|Jaques]] in Act II, Scene VII, of the play ''[[w:As You Like It|As You Like It]]'' by [[w:William Shakespeare|William Shakespeare]] (baptised 1564; died 1616), believed to have been written in 1599.

could be

{{named-from|sub=beginning of a [[monologue]] delivered by the character [[w:Jaques (As You Like It)|Jaques]] in Act II, Scene VIIe|what=play|what_p=As You Like It|by=William Shakespeare|date=1599}}

4.8.4. Template "coined by" also specifying a year

4.8.5. Template for explanation of set phrases, or origin of a proverb

Comments on the interactive tool etytree

Please add a comment here if you have some kind of feedback on the interactive tool etytree. Consider that this is a first version. The javascript code I have used is available here and in the README file I have listed some aspects I would like to work on in the near future. Let me know if you would like to contribute!

Difficult to distinguish different types of etymological derivation

Latest comment: 7 years ago2 comments2 people in discussion

Some way of focusing on different directions (forward or backward derivations). For English words there are often a lot of compound derived terms that obscure the etymology in the other direction. DTLHS (talk) 16:32, 14 February 2017 (UTC)Reply

@DTLHS: Agreed! I would like to extend this work and introduce a modification to the visualization. The idea is to orient arrows all in one direction from left to right going from the earliest ancestor to the most recent descendant. This requires some work with d3 to add some kind of electric field in the force field I am using right now. Epantaleo (talk) 16:56, 14 February 2017 (UTC)Reply

Parse derived terms, descendants sections

Latest comment: 7 years ago2 comments2 people in discussion

These are two other sections that also contain etymological information. DTLHS (talk) 16:44, 14 February 2017 (UTC)Reply

@DTLHS: Thanks for your comment! I have parsed those. Epantaleo (talk) 16:52, 14 February 2017 (UTC)Reply

Mapping between RDF data and Wikidata

Latest comment: 7 years ago1 comment1 person in discussion

Ontologies behind the RDF data structures are defined in the released software DBnary here and here for Lexical Entries and Etymology Entries, respectively, and are documented in there. I will focus on Etymology Entries as I have been working on them. For Lexical Entries the reader could refer to the DBnary paper ^[1]

A Lexical Entry in the RDF database is a Lexeme in the Wiktionary-Wikidata proposal. An Etymology Entry in the RDF database corresponds to a set of Lexical Entries that share the same etymology but that do not necessarily share the same lexical category. E.g. wiktionary:link#Noun and wiktionary:link#Verb are distinct lexical entries and the etymology entry wiktionary:link#Etymology_1 refers to both of them because wiktionary:link#Noun and wiktionary:link#Verb share the same etymology from Middle English wiktionary:linke#Middle_English. At the moment there isn't a correspondent entity in the Wiktionary-Wikidata proposal. As of now there are 4 properties that can connect two Etymology Entries: etymologicallyDerivesTo, derivesFrom, descendsFrom, etymologicallyEquivalentTo. They are all subproperties of etymologicallyRelatedTo. If two Etymology Entries are connected by property etymologicallyDerivesTo, then their connection has been extracted from an Etymology Section. If two Etymology Entries are connected by property derivesFrom, then their connection has been extracted from a Derived terms Section. If two Etymology Entries are connected by property descendsFrom, then their connection has been extracted from a Descendants Section. While it is useful to define these properties in the RDF database (for practical reasons, mostly to know which section the etymological relationships has been extracted from for debugging and for filtering), these could be statements in Wikidata. As of now I think only derived-from has been proposed as a statement.

More specific properties could be defined, like borrowing, back-formation, clique, etc with some modifications to the extraction code.Epantaleo (talk) 15:32, 17 February 2017 (UTC)Reply

↑ Sérasset Gilles (2014). DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. Semantic Web Journal (special issue on Multilingual Linked Open Data).

Number of distinct Etymology Entries

query to etytree virtuoso wmflabs sparql endopoint:

   select distinct count(?s) { 
       ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> dbetym:EtymologyEntry . 
       ?s <http://kaiko.getalp.org/dbnary#refersTo> ?e . 
       ?e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> dbetym:EtymologyEntry . 
   }

output=819563

Congratulations

Latest comment: 7 years ago8 comments2 people in discussion

Awesome. Coincidentally and recently and unknowingly I have just tried to do that with GraphViz and Gephi, but introducing the data by myself. I just tried the uncreated en:wikt:Reconstruction:Proto-Indo-European/paw- (see What links here) < en:wikt:pavor (in order to get data, create the article and fill the gaps). My results are so alike... and you got what I sought so much. I'll try to have some time for more comments.

I'd suggest giving the option to choose different graph types (more linear).

do you have something in mind? I thought the directed graph was the best to represent this data. Do you have experience with d3.js and graphs? My demo was much more linear but did not allow for loops. i was using d3 trees. Epantaleo (talk) 15:13, 17 February 2017 (UTC)Reply

@Epantaleo:1/2 No experience with .js (isn't that a module of Java? or is it .cs or .css). I knew the formats of Gephi and I used the basic (see the ones I added into commons:Category:Etymologic trees). I have to add also the ones with Graphviz, but it's gonna take me more coz I want to include the DOT code. This weekend.

And apply colours of distance from the lowest node (said knot?).

I'm not sure if color will help but I'll submit this question to users!

@Epantaleo: 2/2 You can see the use of colours (for lines, depending on distance in the ones with white background, and depending on none/source/target language in the ones with black background). The ones by Graphviz use the colour only for the nodes.

@Sobreira: Cool! Thanks! :) Good idea to color links or dots.

Do this cognates in boxes interrupt parsing? Do you process cognates? Can it find incoherences (two different nodes of the same language pointing into the same descendant)? Epantaleo (talk) 15:13, 17 February 2017 (UTC)Reply

I am not parsing cognates because I wouldn't know how to link them. In general, about incoherences... I am not sure I want to deal with them right now. Some of them is intrinsic in the data, maybe beacuse different users have different opinion and I leave this decisions to them. Does this answer your question?

I also recently discovered en:wikt:template:etymtree and en:wikt:Template:findetym: can you feed them (there are less than 100 created)?

I saw them, I am parsing etymtree in fact! Epantaleo (talk) 15:13, 17 February 2017 (UTC)Reply

It could be used for Wikisaurus, as it is even more standardised.

Thanks for the pointer! In fact it's a possibility Epantaleo (talk) 15:13, 17 February 2017 (UTC)Reply

Sobreira (parlez) 13:29, 17 February 2017 (UTC)Reply

Thank you so much @Sobreira: there is so much more that can be done to improve it so let me know if you would like to contribute :). See above for comments. Epantaleo (talk) 15:13, 17 February 2017 (UTC)Reply

I would like to, Epantaleo, but I don't know so much about data mining. Nor about etymology beyond Romances from Latin (I was searching and going to classes about PIE lately, but it's a lot I still have to learn). I have a lot of ideas however about technical lexicography, but I don't know how to develop many.

Sobreira (parlez) 15:23, 17 February 2017 (UTC)Reply

@Sobreira: No worries. One thing that could be done is checking that there is no macroscopic bug in the languages you know. For example some users helped me because they explained to me how diacritics work in Arabic and that was very useful (now I can try and fix the bug). But otherwise, you already did a lot by supporting my project. Epantaleo (talk) 15:36, 17 February 2017 (UTC)Reply

[1] Sérasset Gilles (2014). DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. Semantic Web Journal (special issue on Multilingual Linked Open Data).

[1]

April 12 Proposal Deadline: Reminder to change status to 'proposed'

Integration with Wikidata

Maintainability

Eligibility confirmed

Pointers to related work

Comments

Translation to the Wikidata Data Model

Wikidata Toolkit and DBnary

Primary Sources Tool

Aggregated feedback from the committee for A graphical and interactive etymology dictionary based on Wiktionary

Round 1 2016 decision

Website

Suggestions on how to make Etymology Sections easy to parse

1. Disencourage usage of links for ancestors

2. Link to Glossary (for words that are not ancestors)

3. Add a new template (qualifier?) or html code to specify that a word is not an ancestor

4. More suggestions

4.1. Proposal to distinguish "and" and "+"

4.2. Encourage correct usage of comma

4.3. Proposal to set standards for alternative etymologies

4.4. Encourage use of template for earliest known usage

4.5. Deprecate use of "Via" in favor of "From"

4.6. Encourage use of {{unk}}, {{etyl|und}}, {{etystub}}, {{rfe}} instead of unknown etymology, disputed etymology etc

4.7. Encourage use of surface etymology to signal a surface etymology

4.8. Proposal to add new templates

4.8.1. Template "Detailed etymology"

4.8.2. Template "developed from initialism"

4.8.3. Template "named-from"

4.8.4. Template "coined by" also specifying a year

4.8.5. Template for explanation of set phrases, or origin of a proverb

Comments on the interactive tool etytree

Difficult to distinguish different types of etymological derivation

Parse derived terms, descendants sections

Mapping between RDF data and Wikidata

Number of distinct Etymology Entries

Congratulations

4.6. Encourage use of `{{unk}}, {{etyl|und}}, {{etystub}}, {{rfe}}` instead of unknown etymology, disputed etymology etc