Jump to content

Community Wishlist Survey 2020/Wiktionary/What's in the newspaper today?

From Meta, a Wikimedia project coordination wiki

What's in the newspaper today?

  • Problem: Wiktionarians can't detect every new used word in real time to include them as soon as they appear, although they are examples of use accessible online.
  • Who would benefit: Contributors and readers
  • Proposed solution: Development of a tool that harvests online newspapers to record words that are missing in Wiktionaries database.
  • More comments: This tool have to be adapted for each language and/or resource. Darkdadaah created a similar tool and had made it run from 2010 to 2013 for French.
  • Phabricator tickets:
  • Proposer: DaraDaraDara (talk) 14:54, 8 November 2019 (UTC)[reply]


  • Harvesting newspapers is a great way to detect new words, and it helps to have selected sentences to add as examples, after some manual selection as some sentence are correct but too long or too much in need of the context. Also, a thematic labelling may help Wikinewsies and Wikipedians to find more sources. Noé (talk) 11:27, 15 November 2019 (UTC)[reply]
What about licenses of those newspapers? -Theklan (talk) 10:30, 22 November 2019 (UTC)[reply]
@Theklan: it does not matter. The same rule applies for the book. Here the idea is just to crawl all the newspaper everyday and to extract only the sentence with the new word. In that case, this is short citation and it is allowed to use it. See this page with French words for example. Pamputt (talk) 13:40, 23 November 2019 (UTC)[reply]
  • We already have "Wiktionary:Frequency lists" which is based on tv subs and have thousands of missing words in all languages including English. Newspapers will give you a lot of typos and game plays. Not needed as long as we have big lists of missing words. We can also use aspell lists to locate some more missing words. In Wikipedia there is a project named moss (under typo team) that offers thousands of missing words that are used in Wikipedia.Uziel302 (talk) 21:18, 25 November 2019 (UTC)[reply]
    @Uziel302: indeed we will have typo but they should be limited because here we parse newspaper (not blog and forum). And yes, there are alrealdy a lot of missing words but parsing newspaper will help to identify neologisms and then to create the missing entry. It is interesting because people can look for neologism more than rarer words. Pamputt (talk) 06:45, 26 November 2019 (UTC)[reply]